Finding Sister Locations to Help Each Other: Answers & Discussion

Last Updated 2 months ago

This week’s query exercise asked you to find two kinds of locations in the Stack Overflow database:

Locations populated with users who seem to be really helpful, meaning, they write really good answers
Locations where people seem to need the most help, meaning, they ask a lot of questions, but they do not seem to be answering those of their neighbors

In the challenge post, I gave you the same kinds of requirements that our end users would typically give, but not the kinds of requirements that a business analyst might specify. This challenge was actually inspired by a friend of mine who transitioned into an analyst role, and I was reminded of just how hard that work really is. Translating vague requests into detailed specs is a lot harder than it looks, and the wide variety of this week’s answers reflects that.

Normally when I write these challenges, as soon as I’m done writing the challenge post, I immediately start writing my own solution. I didn’t do that here because I wanted to show your different interpretations of the request. As the answers started coming in, I copy/pasted ’em into SSMS just like everything else I find on the internet, and I used 3 sniff tests:

Did your top locations have a decent number of people in them? I didn’t wanna see answers that said, “This podunk town is a location we should focus on.”
Did your two sets of locations have overlaps? I didn’t wanna see answers that said, “This place has a lot of people who write questions, and a lot of people who write answers.” That’s just a list of big places, which violates the “they do not seem to be answering those of their neighbors” part of the request.
Do the result sets show how they were sorted? Anytime I see a ranked list, I want to know how the data was ranked. (In this particular challenge, I’m really flexible on sorting methods – I just wanna know what it was.)

As an example, I’m showing a picture of one answer’s results here. The person shouldn’t be ashamed by any means – this was a hard question, for sure – but it serves to illustrate an answer that doesn’t pass the sniff test. The top result set are the helpful locations, and the bottom result set are the locations that need help. A single location shouldn’t be in both – and yet the same locations show up in both list. This is just a list of the biggest locations, basically.

This would happen if your queries were just:

The users who posted the most answers (without posting questions), and
The users who posted the most questions (without posting answers themselves)

In any big location, you’ll get a lot of those, so that isn’t going to work. Instead, we’re going to need to get more creative with our filtering & sorting.

I’ll discuss two approaches, but to see more, check out the comments on the challenge post, or my own approach on the Postgres version of the answers.

I am a potato?’s Answer

I had to laugh out loud at their name on their comments, and they most certainly were not a potato. Here’s their answer in a Github Gist, and my edited version is a comment on theirs. The results will take a little explaining:

The first result is the helpful people, and to find it, they sorted by average answer score descending. This does favor places that seem to favor quality over quantity: note how #6 (Sherbrooke) produced only 326 answers, and #7 (Sunnyvale) produced over 64x more questions – but they’re ranked lower.

That sort order is completely fine – we gotta pick something to sort by, since the users didn’t tell us a specific sort order. This answer is just as valid as anybody else’s, and I wanted to pick a variety of sort orders to show how the request could be interpreted.

However, there’s another problem with line 6: population is only 31 people. If you look at both result sets, they’re both riddled with places with less than 100 people. So, how’d that happen? They started by building a list of locations with >100 people, but didn’t consider whether those people had produced over 100 questions or answers. To fix it, we just need to filter the result sets for >100 population, and then we’ll get better quality.

The second result set is the people who need help, and to find that, they sorted by a high ratio of questions to answers. The ratio isn’t shown in the results, and I would have preferred that, but that’s okay. The general idea makes sense. Again, we have small populations showing up here, and row 9 isn’t really better than row 10, but once we get a filter for at least 100 people in the location, it’ll be better.

Nick’s Answer

NotPotato’s answer averaged out Posts.Score across the entire population of the location. Nick’s answer took a different approach: he got the average answer score for each user in the location (because people may have posted multiple answers with varying scores), and then averaged out all of the users’ averages. As a result, this top 10 looks different than NotPotato’s:

But also, his second result set looks wildly different than NotPotato’s. Part of that is because Nick’s answer correctly filters for large populations, and another reason is because he used a different metric for what helpful means. He used the formula of (AnswerCount – QuestionCount), calling that AnswerDelta. This is interesting because places with higher volumes of answers will be favored, not necessarily tiny locations with tiny metrics (like 8 answers vs 1 question).

Both of those are valid approaches!

Sure, we had some bumps along the way, but I like this kind of challenge, especially for testing potential new hires. I don’t really care if their sorting choices match my own ideas – I’m much more interested in seeing if they can spot obvious bugs in their result sets, and how their T-SQL looks.

I hope you enjoyed this week’s challenge! The Query Exercises will be taking a break for the next couple of weeks because I’m traveling to Hong Kong and Shanghai. I’ve still got plenty of blog posts queued up, but the interactive exercises will return in April when I’m back. If you’re starving for homework, check out the prior Query Exercises posts and catch up on the ones you missed so far.

[Video] Office Hours From the Future

What Happens When Multiple Queries Compile at Once?

1 Comment. Leave new

TY
March 7, 2024 12:01 pm

Hi Brent, that was a great exercise, thank you!
Just a note here: both links to NotPotato’s and Nick’s answers lead to the same place (all roads lead to Rome xD) –
https://gist.github.com/arboledasaa/80d360ffdca4ba026344f8a99c808b0d
Maybe subconsciously you like this answer better? 🙂

Reply