For this week’s query exercise, let’s start with a brief query to get a quick preview of what we’re dealing with:
1 2 3 4 5 6 |
SELECT TOP 100 Location, COUNT(*) AS Population, AVG(Reputation) AS AvgReputation FROM dbo.Users GROUP BY Location ORDER BY COUNT(*) DESC; |
That query has a few problems, but hold that thought for a moment. (You’re going to have to solve those problems, but I just wanted to show you the sample data at first to give you a rough idea of what we’re dealing with.)
Our business users are thinking about doing some targeted advertising, and they have two questions.
First, what are the top 10 Locations populated with users who seem to be really helpful, meaning, they write really good answers?
Second, what are the top 10 Locations where people seem to need the most help? Meaning, they ask a lot of questions, but they do not seem to be answering those of their neighbors?
Your first guess would probably be to adapt the query above and simply sort in different directions, but buckle up. The data presents a lot of challenges that we need to consider.
- There are null and empty locations. Those should probably be excluded.
- In the Users table, the Location column’s datatype is NVARCHAR(100). People can type in whatever they want, which leads to a lot of oddball one-off locations. We should filter for a minimum number of people involved, especially since we’re considering paid advertising. Let’s filter for only locations with at least 100 people in them.
- For the sake of this exercise, let’s pretend that each Location value is geographically unique. You’ll notice in the screenshot that there are entries for “India” and for “Bangalore”. You and I know that everyone who lives in Bangalore also lives in India, but for the sake of this question, let’s pretend that’s not the case. In reality, obviously we need to clean up our Location data, but I’ll save that for another challenge.
- High values for Reputation do not necessarily mean that people write really good answers. You can earn Reputation points a lot of ways, including writing good questions.
- You don’t have to join the first and second queries together. Just write one query for the top 10 helpful locations, and another query for the top 10 locations that need the most help. They can be separate queries, and I’m not worried about performance for the sake of this exercise.
Part of your solution is going to involve looking at the Posts table, which stores questions and answers (and other things.) A few columns to consider:
- Posts.PostTypeId = when 1, the Post is a Question. When 2, it’s an Answer.
- Posts.OwnerUserId = the Users.Id who owns the question or answer.
- Posts.Score = the quality of the question or answer.
For this exercise, I’m not expecting a specific “right” or “wrong” answer – instead, for this one, you’re probably going to have a good time sharing your answer in the comments, and comparing your answer to that of others. Feel free to put your queries in a Github Gist, and include that link in your comments. I’ll circle back in a week and write about different ways to solve it. Have fun!
12 Comments. Leave new
Hi, Here is my contribution to this Sister Location game, also had to play with indexes to speed it up a bit !
https://gist.github.com/gregy2k/23e7071e406a701100a963c194bd16d0
Greg – cool! Now, with the data in the screenshot in mind, the top location (India) has about 50,000 people, and the next highest have 15k-30k.
If India has 25,000 people who only answer questions (and do it well),
AND they have 25,000 people who only write questions (never answers),
what would your result set look like?
I would say perfectly empty
Ooo, bad news there. For example, here’s how your query results look in the StackOverflow2010 database: https://imgur.com/a/AxA8d5z Note that some locations show up in both result sets.
I also ran it in the 2018-06 copy of the database and got similar results, locations in both result sets: https://imgur.com/5pqB3au
So, here is a slow solution
https://gist.github.com/arboledasaa/80d360ffdca4ba026344f8a99c808b0d
You are no potato! That’s a good answer because it passes my first sniff test: making sure the same locations don’t show up in both result sets.
I tweaked your query a little to show the number of people, questions, answers involved, and here are the results in the 2018-06 copy of the database: https://imgur.com/a/ws3h6AZ
I don’t have anything against your sorting method – you gotta pick SOMETHING to sort by here, and I wasn’t expecting one “right” answer. I like your approach. Good job!
Here’s what I came up with: https://gist.github.com/zsiv/bffea1a7709076ce076e6017bd603331
Cheers!
Nick – nice work! I made a couple of tweaks in the comments on your gist, but overall it’s great! I like your sorting method on the second query – it didn’t occur to me to compare questions & answers looking for a population that’s the most biased towards answers, and I love it. Nice work!
Database used StackOverflow2013.
Quote:
“That’s a good answer because it passes my first sniff test: making sure the same locations don’t show up in both result sets.”
Since when are writing good answers and asking many questions in any way correlated?
Anyway here is my submission for this weeks exercise:
https://gist.github.com/Montro1981/794b3a2625e0b70b546818e57c42b22d
I went with a bit purist way of determining the locations we want to target.
For the first query I used the total of the score as the way to sort the result, as the averages of the Scores are very close together.
For the second query I had to interpret Brent’s question “but they do not seem to be answering those of their neighbors”, did he mean it literal or figurative? So I wrote a literal or figurative query. To determine the result set of the second I used a “Percentage of users asking questions but not answering any” as the way to sort the result set.
Let’s see what Brent thinks (after he take a sip of his tasty beverage).
As I was driving home, it occurred to me that the average should be used in the first query.
[…] This week’s query exercise asked you to find two kinds of locations in the Stack Overflow database: […]