Tag Archive: stackoverflow

StackOverflow VC, SQL Server, and Whuffie

Yesterday StackOverflow announced that they’d accepted $6 million in venture capital funding.  Joel Spolsky held a quick public chat to discuss it, and there were some interesting questions from the audience.  I’m going to paraphrase some of the questions and give my own answers.

Q: Now that StackOverflow is going to be big, will they need to dump SQL Server for NoSQL?

If you were going to write a list of things you should never do to SQL Server that needs to perform fast, StackOverflow would check a lot of the boxes:

  • Using LINQ and Full Text Search heavily for queries
  • Storing data on SATA drives
  • Putting both data and logs on the same drive (not to mention the full text catalogs)
  • Using one server, no fancy replication for load balancing

And SQL Server handles the load just fine.

Q: What? I thought M$$QL was the suxx0rz?

Two things make all the difference.  First, it’s not really all that much load.  StackOverflow is the smallest SQL Server database I work with by an order of magnitude.  Most of my SQL DBA readers manage much bigger databases on a daily basis and yet consider themselves to be junior DBAs.  There’s a disconnect between what programmers see as big data versus what enterprises see as big data.  For example, in the last two weekends, I’ve done performance tuning gigs for two separate companies that had data more than 20x the size of StackOverflow’s, yet didn’t have a full time database administrator.

Second, the staff really knows what they’re doing.  They know you’re supposed to cache frequently reused data in the app tier, for example – sounds obvious, but it’s trickier than it sounds.  If you’re really good – and I don’t mean “I’ve got a blog” good – you can build amazing stuff with just about any tool.  You could build something of StackOverflow’s size on any database platform out there.

If you think the reason your code can’t scale is because of the language or database, you’re probably doin’ it wrong.

Q: How much of that $6 million did you get?

None.  I’ve never been paid by StackOverflow.  If I was Joel and Jeff, I’d give money/stock/cocaine to the community moderators long before I gave it to Brent Ozar.  My work is tiny compared to the moderators, and I’m glad (although a little sad) that they recently revamped the StackOverflow About page to reflect that.

That’s right – I get paid in pixels.

Q: Awesome, here’s a picture of bacon. Now I need your help with…

No.  I help with DBA work at StackOverflow for the same reason you answer questions there.  When you post an answer, add tags, or help clarify questions, your reputation score goes up.  You don’t make money on directly – it’s just fun doing it.

But as your score gets higher, you can use that for things in ways that don’t seem immediately obvious to you yet.  I touched on this in my recent Rock Stars, Normal People, and You post.  Jon Skeet is an extreme example – he can probably walk into any geek gathering, show his ID, and people will start buying him drinks.  If he posted a tweet saying he was looking for work, you’d better not hope you have anything pending at the printer, because an army of programmers will be printing up Jon’s resume immediately to run into their boss’s office and say, “WE GOTTA HIRE THIS GUY!”

The Whuffie Factor

The Whuffie Factor Explains Everything

Your StackOverflow score is your living resume.  It’s like whuffie in Down & Out in the Magic Kingdom – it’s a currency that you can use to get things.  When you get to a high enough score, you can trade it for things – things like consulting gigs.  Companies will look at something like StackOverflow, look up the C# tag, and find the highest rated people.  They’ll review your answers, see the high votes from your peers, and then check your availability for short-term consulting – perhaps even just a single hour.

Not every question can be asked in public, and not every answer can be given without spending time in the client’s systems.  Most importantly, those questions and answers are where the most money changes hands.

Q: But your StackOverflow score sucks.

Yep, I’m part of the old guard.  I’m 36.  My generation had/has a different way of measuring reputation, and frankly, it sucks.  We gauge reputations based on personal relationships with people we’ve met, usually in person but sometimes through social networks.  We’re limited to a smaller group of experts on any given topic.  When I need help with something, I have a fairly limited number of trusted people I can call on.

I work (a little) on StackOverflow for free because it’s the old-school equivalent of a reputation score.  People have come to me and said, “I hear you’re the database guy for StackOverflow – what would it take to get you to help me with ___?”  That’s why I’m quite happy to take my pay in pixels, and why I know that your high StackOverflow or ServerFault reputation score will be worth money down the road.

How much would you pay for one hour of Jon Skeet’s time?

What if you had a tough C# question and you couldn’t show your code in public?  What if you wanted to listen to him do training presentations about what he knows?  Would you pay real money for that?  I know that you would, because people are paying to attend SQLCruise with me and Tim Ford, and people lined up to pay $99 to get into StackOverflow DevDays.  What if you had a really high StackOverflow/ServerFault reputation for a given tag, and you organized an event like SQLCruise or DevDays for your own tech interest?

Reputation is everything.  This is why I get so excited about StackOverflow’s reputation scores – conventional forums failed not just because they’re painful to navigate, but because they didn’t measure things.  When you measure reputation, you enable all kinds of ways to make money.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Querying the StackOverflow Data Dump

StackOverflow, ServerFault, and SuperUser are Q&A sites for IT professionals.  I’ve blogged about why I like ServerFault before, but perhaps one of the coolest reasons for database people is that they make their data public.  Every month, StackOverflow dumps out their data to XML.  You can import the data dump into SQL Server, and the whole thing is less than 10gb as of this writing.

But you haven’t done that because you’re lazy.

You just want to open up SQL Server Management Studio 2008 or Toad for SQL Server and connect to the database.  Alright, you got it:

  • SQL Server: brentozar.dyndns.org (as of this writing, it’s 71.57.120.247 – if the name doesn’t work for you, try the IP)
  • Username: StackOverflow_Reader
  • Password: c0mm0ns
  • Databases you can access: ServerFault, StackOverflow, StackOverflowMeta, SuperUser

The data is not a “live” copy – it’s just the monthly Creative Commons data dump, and the schema is the raw output from Sam Saffron’s data dump tool linked above.  The server is a desktop-class machine, nothing fancy, and it’s using my home internet connection.   (Yes, that’s why I’m posting this halfway through the day on Friday – easing into the load.)  You can get a snapshot of my current desk gear and my servers at Flickr.

If you can’t connect to the SQL Server, there’s probably a firewall blocking port 1433 between your workstation and my lab.  Please don’t leave a note to complain – try accessing it from another location, like from your house instead of your work.

To learn more about the schema and how to query it, check out these articles at SQLServerPedia:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Designing a Recovery Strategy for StackOverflow

When I design a backup & recovery strategy for a database, I don’t talk to the developers, database administrators, or systems administrators first.  The first people I go to are the business managers, and I ask three questions:

  • How much money would you lose if you lost the data altogether?
  • How much money would you lose if you were down for X hours?
  • How much time & money can you afford to devote to backups?

Each of these questions drive the strategy, and none of these questions are actually answered by the DBA.  In Jeff Atwood’s post about StackOverflow’s backup & recovery strategy this week, there’s been a lot of comments suggesting alternate backup methods.  Curiously, though, none of them seem to actually ask any of those three questions.  Let’s examine these questions one by one, and then look at StackOverflow‘s choices to see if they make sense.

How much is your data worth?

If you had to reconstruct the data from scratch, using other systems and records, how much would it cost you to do it?  And would you actually pursue that objective, or would you consider folding up the business altogether, or maybe living without the data?

If your database holds financial transactions for your customers, with live incoming credit card transactions and debit card withdrawals, it’s obviously very valuable, and you’d be forced to spend a fortune to get back in business.  If you’re owned by Microsoft and you lose all Sidekick customer contact data, you pour money and manpower into getting it back.  Other companies like Magnolia and Journalspace, on the other hand, have decided to pack their bags and call it a day.

The business has to work with IT to come up with a quick back-of-the-envelope calculation as to what complete data loss would cost, and that’s part of the formula that dictates what we spend on data protection.  Sometimes this simple question leads businesses to realize, “That particular data doesn’t really matter – we could rebuild it all from other sources for next to nothing.  Maybe we shouldn’t back it up at all.”

How much money does downtime cost?

While this database is down, can you still sell your products?  If not, then it’s easy to calculate the cost of downtime – it’s your sales metrics.  If you sell an average of $100,000 per hour, then an hour of downtime costs you $100,000.

And furthermore, if you can’t sell products, do your customers hold off on the purchase, or do they switch to another vendor?  If Amazon.com’s databases go down, then their customers probably won’t wait around until the site comes back up.  They’ll head straight over to another web site and spend money with a competitor.  This has a hidden business danger, too – if your customers like that new site better, they might stick with it and bypass you for future orders.

However, if you can still sell products, keep your customers & employees happy, and business moves along unaffected except for a few bumps, then that might guide your backup & recovery strategy too.  Or if your company isn’t making all that much per hour, then maybe you don’t want to dedicate a fortune to having your systems highly available.

How much resources can you devote to backups?

Availability costs time and money.

The more available your system needs to be, the more time and money it costs.  If you’re a global enterprise with a killer cash flow, then you can make more conservative decisions, back up more databases more often, and not be as concerned with the costs.  If you’re a startup with three guys, and all your revenue goes towards paying salaries, then you want to watch those backup costs a little more closely.

In addition, backups cost more than just money.  If you need up-to-the-minute recovery with constant transaction log backups, you have to put your database in full recovery mode – which can slow things down.  If you want the fastest possible response times, and you’re looking for every millisecond edge against your competitors on each page load, backups are going to hit your radar.
stackoverflow

So how does StackOverflow stack up?

Let’s ask the three questions:

  • How much is their data worth? Their data consists of questions and answers from the programming community.  Sure, they’re the #1 programming site in the world, but even the words of Jon Skeet are only worth so much.
  • How much money does downtime cost? This might sound callous to users, but if StackOverflow was down for four hours, the vast majority of users would get over it.  They might post a few questions elsewhere, but for the most part, they’d just sit around on Twitter complaining, refreshing their browser while they waited for StackOverflow to come back up.  They’re addicted, and they’ll tolerate downtime.
  • How much resources can they devote to backups? StackOverflow is a small startup trying to make a living off ad revenue.  Their primary target users are extremely tech-savvy people who are fully aware of tools like Firefox and Adblock Plus, making it even more challenging.  In an ideal world, they’d have a SAN with snapshot sub-second backup & restore technology – but that costs a lot of money, and it’s not realistic.  Frankly, every bit of traffic in and out of their colo servers costs them money, and not an insignificant amount.

With these answers in mind, StackOverflow’s decisions not to do transaction log backups, offsite log shipping, database mirroring, and so on make good business sense. Us geeks in the crowd may not like it, and we might demand the latest and greatest in backup & recovery technology, but at the same time we want StackOverflow to remain free.  As their volunteer DBA, I’d love to do 24×7 log shipping or database mirroring to a secondary server at another colo facility – but I wouldn’t be willing to pay out of my own pocket for expenses like that.

StackOverflow Database Server

StackOverflow Database Server

To drive the resources part home, take a look at the database server as shown in Jeff’s Stack Overflow Rack Glamour Shots post this week.  Count the number of hard drives.  That’s six SATA drives shared by the OS, page file, database files, log files, and full text catalogs to serve over one million pageviews per day.  Many of you out there use a server like this as your development server, and you complain that it’s slow.  Guess what – this is both their production server and development server.  They’re achieving some incredible stuff with a very limited hardware budget, and it’s a testimonial to what you can do if you really, really focus on performance.

And while I’ve got you thinking about backups, now’s a great time to check out some of my other backup articles:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

My Weekly Bookmarks for October 30th

Here’s my bookmarked links for October 26th through October 30th:

SQL Server Links

#SQLPASS Links

Tech Links

The Junk Drawer

These bookmarks are automatically imported from my bookmarks at Delicious.com. If you’d like to get up-to-the-minute updates on what I’m bookmarking, you can subscribe to my bookmark RSS feed.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

My Weekly Bookmarks for October 9th

Here’s my bookmarked links for October 2nd through October 9th:

SQL Server Links

Tech Links

The Junk Drawer

  • I Love That Game – Brilliant criminal minds at work.
  • Twitter Data Analysis: An Investor’s Perspective – A bunch of oddball stats about Twitter users and their histories.
  • Will Work for Whuffie? – Why you have to charge fees for speaking engagements when you hit a certain level of fame. (No, I’m not there yet, hahaha, but even if I was, my speaking engagements are free because I’m a service of Quest Software. No, not that kind of “service,” buddy.)

These bookmarks are automatically imported from my bookmarks at Delicious.com. If you’d like to get up-to-the-minute updates on what I’m bookmarking, you can subscribe to my bookmark RSS feed.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

My Weekly Bookmarks for October 2nd

Here’s my bookmarked links for September 25th through October 2nd:

SQL Server, Cloud, and Tech Links

Writing, Blogging and Networking Links

The Junk Drawer

These bookmarks are automatically imported from my bookmarks at Delicious.com. If you’d like to get up-to-the-minute updates on what I’m bookmarking, you can subscribe to my bookmark RSS feed.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

SPWho2.com: StackOverflow user and tag statistics

I took the StackOverflow database dump, brought it into SQL Server, and did some slicing and dicing by tags and users.  I wanted to find the answers to questions like:

I’m having a lot of fun with the data, and I thought other people might enjoy it too.

To make it easier, I built SPWho2.com, a site with StackOverflow user and tag statistics.  It’s not terribly attractive yet, unless like me, you find numbers attractive.  It’s a side hobby for me right now, and over time I’ll add in more data visualization with graphs and trends.  If there’s anything you’d like to see added, let me know.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Finding old, unanswered StackOverflow questions

So you wanna start answering questions on StackOverflow, but you’re frustrated because it seems like people are fillin’ out answers even faster than you can type?  Find yourself up at 3am hitting Refresh just hoping a new question comes in?

Necromancer Badge Shirt

Necromancer Badge Shirt

Here’s a new trick:

  1. Import the StackOverflow database into SQL Server
  2. Run this query to find old, unanswered questions
  3. Profit!  Err, no, but watch your reputation go up as you solve old questions.

It turns out there’s a little (and I do mean little) niche of questions ripe for the pickin’.  I spent half an hour around midnight last night answering a handful of old SQL Server questions and I’ve already gained 75 points – pretty surprising for a Friday night.  Even more surprising to me, I’d posted an answer to a question that had been dormant since December (6 months ago), and the questioner already checked my answer and thanked me for it.  See, answering old questions really does help people, not just game the system for reputation points!  (That’s how I rationalize it anyway.)

Another boost: you’ve got a better chance of earning the Necromancer badge, which is awarded when someone answers a question more than 60 days old and then gets 5 upvotes on their answer.

Oh, and the shirt?  Yeah, I’ve set up a few StackOverflow badge shirts, including badges that don’t exist (yet) like Fastest Gun and Bounty Hunter.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

How to Import the StackOverflow XML into SQL Server

Want to play around with the StackOverflow database export?  Here’s how to import the XML files into SQL Server, and some notes about the tables and data schema.

Script to Import StackOverflow XML to SQL Server

This T-SQL script will create six stored procedures:

  • usp_ETL_Load_Badges
  • usp_ETL_Load_Comments
  • usp_ETL_Load_Posts
  • usp_ETL_Load_Users
  • usp_ETL_Load_Votes
  • usp_ETL_Load_PostsTags (which isn’t one of the StackOverflow tables – more on that in a minute)

The XML import code is from an excellent XML tutorial by Denny Cherry.  The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file.  The schema matches the XML file with one exception – I added an identity field to the Badges table.  The rest already had Id fields.  The tables don’t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I’ll be giving out more scripts over the coming days and weeks that rely on the base tables.  If you want to add more data, add additional tables.  Plus this will keep your importing clean anyway – you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.

After importing, the database is about 2gb of data.  Be aware that depending on your database’s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.

If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions.  If you don’t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands.  Using DELETE will take significantly longer to run.  For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.

These stored procs only work for the new database dump released on Monday morning, not the one released last week.  If you get invalid XML errors while importing, you’ve got the older database dump.  Go get the fresh hotness.

Now for some schema notes, and I’m going to go out of alphabetical order because everything links back to the Users table.  I’m only going to cover the fields that aren’t immediately obvious:

Users Table

  • Id – primary key, identity field from the original StackOverflow database.  Id 1 is “Community”, which is a special user that denotes community ownership, like wiki questions and answers.
  • LastAccessDate – this is useful because it tells you when the data export was last updated.  If you’re doing queries for things like the last 30 days, check the most recent date here.
  • Age – the user enters this manually, so it’s not terribly reliable as I discovered earlier.
  • AboutMe – I’m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).
  • UpVotes and DownVotes – the number of votes this user has cast.

Posts Table

In StackOverflow, questions and answers are both considered posts.  If a record has a null ParentId field, then it’s a question.  Otherwise, it’s an answer, and to find the matching question, join the ParentId field up to Posts.Id.

  • Id – primary key, identity field from the original StackOverflow database.
  • Title – the title of the question.  Answer titles will be null.
  • OwnerUserId – joins back to Users.Id.  If OwnerUserId = 1, that’s the community user, meaning it’s a wiki question or answer.
  • AcceptedAnswerId – for questions, this points to the Post.Id of the officially accepted answer.  This isn’t necessarily the highest-voted answer, but the one the questioner accepted.
  • Tags – okay, time to blow out of the bullet points for a second.

StackOverflow limits you to five tags per question (answers aren’t tagged), and all five are stored in this field.  For example, for question 305223, the Tags field is “<offtopic><fun><not-programming-related><jon-skeet>”.  It’s up to you to normalize these.  If you’d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields.  Each Posts record (questions only) will then have several child records in PostsTags.

Next, check the contents of the Tag field carefully.  StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag.  However, in the database, these are stored as “aspûnet”.  Just something to be aware of.

Comments Table

  • Id – primary key, identity field from the original StackOverflow database.
  • PostId – the post parent for this comment.  Joins to the Post.Id field.
  • UserId – who left the comment.  Joins to the User.Id field.

Badges Table

  • Id – an identity field for a primary key.  This number is meaningless – I just added it for some referential integrity.
  • UserId – joins back to Users.Id to show whose badge it is.
  • Name – the name of the Badge, like Teacher or Nice Answer.
  • CreationDate – when the user achieved the badge.

Votes Table

This stores the votes cast on posts, but the key field is VoteTypeId.  The VoteType table wasn’t included in the export, so this table isn’t too useful yet, but if the guys give me the OK I’ll post the contents of that table here.  The Votes table doesn’t include *who* cast the votes, and I’ve got my hands full analyzing the other tables anyway, so I haven’t been interested in the VoteTypes yet.

All of the Id fields except for Badges.Id are from StackOverflow’s original database.  In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump.  That’s the theory, but in reality, you shouldn’t rely on anybody else’s ID fields, because there’s no reason to believe these won’t completely change down the road.  Who knows – Jeff might switch over to GUIDs as primary keys.

Sample Questions Query

Once you’ve got it all together, you can do some fun stuff. Let’s look at some overall statistics about questions (not answers):

SELECT COALESCE(COUNT(DISTINCT p.ID),0)           AS Questions
       ,COALESCE(AVG(p.Score * 1.00),0)           AS AvgScore
       ,COALESCE(AVG(p.ViewCount * 1.00),0)       AS AvgViewCount
       ,COALESCE(COUNT(DISTINCT p.OwnerUserId),0) AS DistinctQuestioners
       ,COALESCE(AVG(p.AnswerCount * 1.00),0)     AS AvgAnswerCount
       ,COALESCE(AVG(p.CommentCount * 1.00),0)    AS AvgCommentCount
       ,COALESCE(AVG(p.FavoriteCount * 1.00),0)   AS AvgFavoriteCount
       ,COALESCE(COUNT(ClosedDate),0)             AS ClosedQuestions
       ,COALESCE(AVG(u.Reputation * 1.00),0)      AS AvgQuestionerReputation
       ,COALESCE(AVG(u.Age * 1.00),0)             AS AvgQuestionerAge
       ,COALESCE(AVG(u.UpVotes * 1.00),0)         AS AvgQuestionerUpVotes
       ,COALESCE(AVG(u.DownVotes * 1.00),0)       AS AvgQuestionerDownVotes
FROM   dbo.Posts p
       INNER JOIN dbo.Users u
         ON p.OwnerUserId = u.Id
WHERE p.Tags IS NOT NULL

And some of the results are:

  • Questions – 176,137
  • Average Score – 1.89
  • Average View Count – 311
  • Distinct Questioners – 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions – there may be some odd stuff in here around anonymous questions though, haven’t looked at that yet)
  • Average Answer Count – 4
  • Average Comment Count – 2.3
  • Closed Questions – 3,656 (or 2% of all questions)
  • Average Questioner Reputation – 1,506
  • Average Questioner Age – 30 (but remember, that’s unreliable)

I’m just getting started playing with it, and I’ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days.  In the meantime, you can download the StackOverflow database dump via BitTorrent and download my ETL stored procs.

Update: Sample StackOverflow Queries in the SQLServerPedia Wiki

Jon Skeet had an excellent idea: we need a wiki to store interesting queries.  Wouldn’t you know, I happen to run one!  I added a section in SQLServerPedia for sample StackOverflow database queries.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

StackOverflow Data Mining: Cleansing the Data

The first stage of mining is a dirty, ugly business.

My Datacenter

My Datacenter

Miners don’t emerge from tunnels bearing armfuls of shiny diamonds.  They come out with filthy, misshapen rocks that might be something valuable – but might be worthless junk.  There’s no way to tell what you’ve really got until you’ve spent some time analyzing and polishing.

Take one of my early findings in the StackOverflow database export: the average age of StackOverflow users is 31, but in May, the average age of the person asking a question tagged “hook” was 59.  That’s a serious deviation.  At the other end of the scale, people asking questions tagged “ec2″ had an average age of, uh, zero.  While there is the possibility that RockhardAwesome is hard at work building virtual machines in Amazon Ec2, I’m voting that one down.

That’s what I get for jumping into mining without cleaning off my rocks first.

Out of the 86,110 users in the database export, only 22,747 provided their age – and the key phrase is “provided their age.”  You can’t trust any data you get from human beings, especially these particular folks:

Ed – Age 256
svec – Age 109
deuseldorf – Age 89
Coding the Wheel – Age 89
Keng – Age 89
Will Dean – Age 89
kokos – Age 89
ColinYounger – Age 89
Lars Truijens – Age 89
dydx – Age 89
Confused Computer Guy – Age 89
Ian Kelling – Age 89
davr – Age 89
Smirking Liberal – Age 89
Sam Meldrum – Age 89
DrStalker – Age 89
Frans – Age 89
Mark Bessey – Age 89
Tony Andrews – Age 89
Pat – Age 89
J-P – Age 89
Simon – Age 89
danb – Age 89
dhislop – Age 89
Matt Rogish – Age 89
Josh – Age 89
pozdziemny – Age 89
chinna – Age 89
Alan Storm – Age 89
Joseph Ducreux – Age 89
jamesh – Age 89
toobstar – Age 89
markd – Age 89
Atif Aziz – Age 89
Peter Boughton – Age 89
que que – Age 89
DJ – Age 89
Cliff – Age 89
gaoshan88 – Age 89
King Avitus – Age 89
alden – Age 89
Alan – Age 89
yx – Age 89
ElephantMoss – Age 89
Loki – Age 89
Tautologistics – Age 89
Alkini – Age 89
h_power11 – Age 89
Click Upvote – Age 89
Salty – Age 89
Sean James – Age 89
kenneth – Age 89
ysangkok – Age 89
Pod – Age 89
Edward – Age 89
MedicineMan – Age 89
Heikki Toivonen – Age 89
Stuart – Age 89
ForceMagic – Age 89
Jane Sales – Age 89
hanesjw – Age 89
xx – Age 89
Silfheed – Age 89
noob source – Age 89
Snickers – Age 89
davefb – Age 89
markti – Age 89
sampablokuper – Age 89
afitzpatrick – Age 89
mishac – Age 89
Computer Security – Age 89
oofoe – Age 89
Tyler Egeto – Age 89
jeffa00 – Age 89
Nikola Jevtic – Age 89
Dave – Age 89
monkeysword – Age 89
wowus – Age 89
sgargan – Age 89
saidireddy – Age 89
Bobby Fever – Age 89
Zaakk – Age 88
Gary – Age 88
rlb.usa – Age 88
tan – Age 88
Kieranmaine – Age 88
Ainab – Age 88
Sleep Deprivation Ninja – Age 88
joelhardi – Age 87
Simon H – Age 86
Nick Hildebrant – Age 86
alanl – Age 84
Dustin – Age 81
jeffamaphone – Age 80
molf – Age 80

I applaud these folks for their civil disobedience, and curse them for same.  There’s an interesting underlying correlation: people who ask questions about hooks seem to be more likely to lie about their age.  I’ll leave that as an exercise for the reader.

On the bright side, I’ve found some other interesting bits of data, although these are still very much rocks that haven’t been cleansed yet:

  • Questions tagged beginner get significantly higher upvotes than other questions (avg 391, sitewide avg 120), which might indicate that if you wanted an upvoted question, write one for beginners.
  • Questions tagged routing, resources, video, programming or google are favorited more than twice as often as the average.
  • Questions tagged svn are asked by people who do more downvoting than other users (avg 18, sitewide avg 10).  Conversely, questions tagged vim or interop are asked by people who do more upvoting (avg 324 and 303, sitewide avg 119.)
  • Questions tagged homework are asked by younger users (avg age 24, sitewide question avg 29).  Makes sense.

I’ll dig more into this tomorrow, but now I’m off to see my dad to celebrate his 60th birthday.  Hmmm – you know, come to think of it, I haven’t actually seen his driver’s license…

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts