Updated, Larger Stack Overflow Demo Database

Stack Overflow
9 Comments

Stack Overflow publishes a data dump with all user-contributed content, and it’s a fun set of data to use for demos. I took the 2024-April data dump, and imported it into a Microsoft SQL Server database.

It’s an 31GB torrent (magnet) that expands to a ~202GB database. I used Microsoft SQL Server 2016, so you can attach this to anything 2016 or newer. If that’s too big, no worries – for smaller versions and past versions, check out my How to Download the Stack Overflow Database page.

Some quick facts about this latest version:

  • Badges: 51,289,973 rows; 4.7GB
  • Comments: 90,380,323 rows; 26.1GB
  • Posts: 59,819,048 rows; 162.8GB; 32.7GB LOB – this is where you’ll find questions & answers
  • Users: 22,484,235 rows; 2.6GB; 12.5MB LOB
  • Votes: 238,984,011 rows; 5.9GB – a fun candidate for columnstore demos

As with the source data, this database is licensed under cc-by-sa-4.0:  https://creativecommons.org/licenses/by-sa/4.0/ And to be very clear, this is not my data. The data and the below licensing explanation comes from the Stack Overflow Data Dump’s page:


But our cc-by-sa 4.0 licensing, while intentionally permissive, does require attribution:

Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Specifically the attribution requirements are as follows:

  1. Visually display or otherwise indicate the source of the content as coming from the Stack Exchange Network. This requirement is satisfied with a discreet text blurb, or some other unobtrusive but clear visual indication.
  2. Ensure that any Internet use of the content includes a hyperlink directly to the original question on the source site on the Network (e.g., http://stackoverflow.com/questions/12345)
  3. Visually display or otherwise clearly indicate the author names for every question and answer used
  4. Ensure that any Internet use of the content includes a hyperlink for each author name directly back to his or her user profile page on the source site on the Network (e.g., http://stackoverflow.com/users/12345/username), directly to the Stack Exchange domain, in standard HTML (i.e. not through a Tinyurl or other such indirect hyperlink, form of obfuscation or redirection), without any “nofollow” command or any other such means of avoiding detection by search engines, and visible even with JavaScript disabled.

This will probably be the last database update.

Prosus (a tech investment company) acquired Stack Overflow a few years ago for $1.8 billion. When a company’s founders sell their baby for money:

  • The new owners usually want to make a profit on their large investment, and
  • The new owners rarely share the same goals as the original founders, and
  • Sometimes the new owners spent way, way too much (hi, Elon) and are forced to make tough decisions to make their debt payments and keep the company afloat

So now Prosus wants to earn their $1,800,000,000 back, and they’re looking at the actual product they bought. StackOverflow.com has 3 components[1]:

  1. An online app that gives you good-enough answers, quickly
  2. The existing past answers already contributed by the community
  3. The potential of future answers continuing to go into the platform

Can Prosus compete on #1? No. Just no. Companies like OpenAI (ChatGPT), Google (Gemini), and Anthropic (Claude) simply have a better solution for #1, full stop, end of story. A web site – even a free one – can’t beat ChatGPT’s ability to integrate directly with your development environment, review your code & database, and recommend specific answers for the problem you’re facing. Game over.

Can Prosus compete on #2? No. The existing answers (as of April 2, 2024) are available for free with nearly no restrictions. The horse is already out of the barn. Moving on.

Can Prosus compete on #3? If ChatGPT and their friends win on #1 and #2, then the default place for developers to find answers is no longer the web browser. (It’s ChatGPT or Copilot or whatever). Whatever happens next is going to be intriguing. Today, you and I are conditioned to think, “I’ll post that question on Stack or a forum.” Tomorrow’s developers will not have that same bias:

  • Maybe the dev will prompt ChatGPT, “Can you find me answers online for this?” In that case, the LLM will search the web and summarize – and Prosus won’t stand a chance of convincing the user to post the question at StackOverflow.com.
  • Maybe the dev will open their web browser and ask the question. In that case, the search engine company will try to summarize answers too. These days, both Google and Bing try to avoid landing you on actual web sites, and try to give you the answers on their own pages instead, whether it’s AI-summarized answers or hallucinations or web page summaries next to each site.
  • Maybe the dev will go to the Github repo for the related project, and post a question there.

I don’t see an easy way for Stack Overflow to inject themselves into that workflow in the year 2030. I’m sad about that because I have a long personal history with Stack Overflow. At the same time, I’m also kinda glad that the original founders, employees, and advisors (me included) were able to cash out thanks to Prosus’s $1.8B overspending just before the generative AI boom hit.

Prosus needs solutions fast: Stack is now losing $150,000 per day. Prosus’s 2024 annual reports noted that Stack Overflow had $98M in incoming revenue – but lost $57M. I can understand why managers might flail at a company’s switches and dials trying to find a way to stop the financial bleeding.

One of the dials they’ve been flailing at is turning down community access to the past answer data, aka business part #2. In their minds, they’re trying to stop OpenAI/Google/Anthropic from making so much money on the back of Stack’s answers. Earlier this year, Prosus tried to pump the brakes on providing the data dumps in XML format on a regular basis, and there was some community outrage, so they relented. However, they’re back: last week, Prosus announced they’re limiting access again.

Based on what Prosus is saying in that post, going forward, I don’t think Prosus will approve of me redistributing new data dumps in a database format. I’m not going to waste time or energy fighting that battle – I’d rather they spent their own energy trying to figure out a way to keep StackOverflow.com a viable business concern going forward. Hopefully they find fun, productive ways to do that, ways that bring the community together onto Prosus’s side rather than turning consumers against Prosus.

However, if Prosus management is willing to limit the data dump, then I have a bad feeling that more barriers are coming over the years. Next, they’ll make answers harder to access for people who have an ad blocker, or who aren’t signed in, or who haven’t paid for a “premium” Stack membership. I’m not mad at them about this, because I don’t have any answers to turn the business around either, and I haven’t heard from anybody who does.

You either die a hero or live long enough to see yourself become the arch-enemy.


[1] Technically the company Stack Overflow has a couple other parts: advertising and Stack Overflow for Teams. Both of those business models are at risk due to AI as well. Their other attempts at diversification, like Articles and Jobs and Developer Story, never caught on.

Previous Post
[Video] Office Hours in Bulgaria
Next Post
Query Exercise: Fix This Computed Column.

9 Comments. Leave new

  • This sort of article is what keeps me coming back to brentozar.com long after I stopped working with SQL Server!

    Reply
  • Mark Wonsil
    July 16, 2024 4:07 pm

    Without StackOverflow, where will the AI companies train their models?! 🙂

    Reply
    • I know you wrote it as a sarcastic and/or rhetorical question, but there’s a real answer: documentation and code. In the age of open source, AI can index the contents of Github, read the product’s code, read its documentation, issues, etc.

      Reply
  • TechnoCaveman
    July 16, 2024 4:22 pm

    About part 1- AI, ChatGPT provides a better answer is mind blowing.
    As for part 3- future questions. This is where I hope Stack Overflow continues. Why? Human curated answers make me feel better. As a second effect – it is good to see the “also ran” answers. To see how others tried to solve the problem.
    The old UNIX BSD 4.2 manual had a “See Also” section. Peoples first try is not always on the mark, but close. The “See Also” gave pointers to related functions that might be better. Yes this last feature involves people which are expensive (unless you can get them for free: i.e. volunteers or community posts)

    Reply
  • Brian Leach
    July 16, 2024 4:55 pm

    Mr. Ozar suggested trying out ChatGPT so I tried it on a small PowerShell function. ChatGPT comments on the code were spot on. It came up with an alternative function that used an innovative method I hadn’t thought of but returned the WRONG answer. I was able to clean up the revised code and I now use it.

    Just sayin’, test and verify anything AI comes up with. Just sayin’.

    Reply
    • Yes, just like you should already be well accustomed to when taking advice from web pages, right?

      Surely you wouldn’t copy/paste directly code from Stack Overflow and put it into production without testing that as well, right?

      RIGHT?

      Reply
    • TechnoCaveman
      July 17, 2024 11:57 am

      Brian, you are 100% right. Test.
      Sometimes people do not state the problem correctly, other times they make assumptions about the range of values or if there are duplicates. Comments like this is why I like blogs and StackOverflow over ChatGPT.
      Next time you get a new DB search the name field of sys.indexes for ‘ []’ People will past the index suggestion from a program or SSMS and not even name or put a date on it.

      Reply
  • […] want to play around with the AdventureWorks sample databases, but anything more realistic, like the StackOverflow […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.