StackOverflow Data Mining: Cleansing the Data


The first stage of mining is a dirty, ugly business.

My Datacenter
My Datacenter

Miners don’t emerge from tunnels bearing armfuls of shiny diamonds.  They come out with filthy, misshapen rocks that might be something valuable – but might be worthless junk.  There’s no way to tell what you’ve really got until you’ve spent some time analyzing and polishing.

Take one of my early findings in the StackOverflow database export: the average age of StackOverflow users is 31, but in May, the average age of the person asking a question tagged “hook” was 59.  That’s a serious deviation.  At the other end of the scale, people asking questions tagged “ec2” had an average age of, uh, zero.  While there is the possibility that RockhardAwesome is hard at work building virtual machines in Amazon Ec2, I’m voting that one down.

That’s what I get for jumping into mining without cleaning off my rocks first.

Out of the 86,110 users in the database export, only 22,747 provided their age – and the key phrase is “provided their age.”  You can’t trust any data you get from human beings, especially these particular folks:

Ed – Age 256
svec – Age 109
deuseldorf – Age 89
Coding the Wheel – Age 89
Keng – Age 89
Will Dean – Age 89
kokos – Age 89
ColinYounger – Age 89
Lars Truijens – Age 89
dydx – Age 89
Confused Computer Guy – Age 89
Ian Kelling – Age 89
davr – Age 89
Smirking Liberal – Age 89
Sam Meldrum – Age 89
DrStalker – Age 89
Frans – Age 89
Mark Bessey – Age 89
Tony Andrews – Age 89
Pat – Age 89
J-P – Age 89
Simon – Age 89
danb – Age 89
dhislop – Age 89
Matt Rogish – Age 89
Josh – Age 89
pozdziemny – Age 89
chinna – Age 89
Alan Storm – Age 89
Joseph Ducreux – Age 89
jamesh – Age 89
toobstar – Age 89
markd – Age 89
Atif Aziz – Age 89
Peter Boughton – Age 89
que que – Age 89
DJ – Age 89
Cliff – Age 89
gaoshan88 – Age 89
King Avitus – Age 89
alden – Age 89
Alan – Age 89
yx – Age 89
ElephantMoss – Age 89
Loki – Age 89
Tautologistics – Age 89
Alkini – Age 89
h_power11 – Age 89
Click Upvote – Age 89
Salty – Age 89
Sean James – Age 89
kenneth – Age 89
ysangkok – Age 89
Pod – Age 89
Edward – Age 89
MedicineMan – Age 89
Heikki Toivonen – Age 89
Stuart – Age 89
ForceMagic – Age 89
Jane Sales – Age 89
hanesjw – Age 89
xx – Age 89
Silfheed – Age 89
noob source – Age 89
Snickers – Age 89
davefb – Age 89
markti – Age 89
sampablokuper – Age 89
afitzpatrick – Age 89
mishac – Age 89
Computer Security – Age 89
oofoe – Age 89
Tyler Egeto – Age 89
jeffa00 – Age 89
Nikola Jevtic – Age 89
Dave – Age 89
monkeysword – Age 89
wowus – Age 89
sgargan – Age 89
saidireddy – Age 89
Bobby Fever – Age 89
Zaakk – Age 88
Gary – Age 88
rlb.usa – Age 88
tan – Age 88
Kieranmaine – Age 88
Ainab – Age 88
Sleep Deprivation Ninja – Age 88
joelhardi – Age 87
Simon H – Age 86
Nick Hildebrant – Age 86
alanl – Age 84
Dustin – Age 81
jeffamaphone – Age 80
molf – Age 80

I applaud these folks for their civil disobedience, and curse them for same.  There’s an interesting underlying correlation: people who ask questions about hooks seem to be more likely to lie about their age.  I’ll leave that as an exercise for the reader.

On the bright side, I’ve found some other interesting bits of data, although these are still very much rocks that haven’t been cleansed yet:

  • Questions tagged beginner get significantly higher upvotes than other questions (avg 391, sitewide avg 120), which might indicate that if you wanted an upvoted question, write one for beginners.
  • Questions tagged routing, resources, video, programming or google are favorited more than twice as often as the average.
  • Questions tagged svn are asked by people who do more downvoting than other users (avg 18, sitewide avg 10).  Conversely, questions tagged vim or interop are asked by people who do more upvoting (avg 324 and 303, sitewide avg 119.)
  • Questions tagged homework are asked by younger users (avg age 24, sitewide question avg 29).  Makes sense.

I’ll dig more into this tomorrow, but now I’m off to see my dad to celebrate his 60th birthday.  Hmmm – you know, come to think of it, I haven’t actually seen his driver’s license…

Previous Post
Give Me a Coconut and Six Months
Next Post
How to Import the StackOverflow XML into SQL Server

7 Comments. Leave new

  • Corey Doctorow wrote a bit of Semantic Web skepticism: ‘Metacrap’. It would seem to apply here:

    # 2. The problems

    * 2.1 People lie
    * 2.2 People are lazy
    * 2.3 People are stupid
    * 2.4 Mission: Impossible — know thyself
    * 2.5 Schemas aren’t neutral
    * 2.6 Metrics influence results
    * 2.7 There’s more than one way to describe something

    But that’s not a reason to give up. Just give up on any meaningful age-related statistics.

    Happy b-day to yr dad!

  • Do you have any idea why so many ages are 89?

    I don’t remember off the top of my head, but is the age control a drop down list starting at 1920?

  • The age control is a date field where you put in your date of birth. Maybe all of those users picked a date of an event in 1920? I’ve got access to the database, but I steadfastly refuse to look, hahaha. That feels like it’s cheating. It’s like reading show recaps on the internet before you see the show on Tivo – it just spoils all the fun. That pretty much sums up my approach to this data – it’s like playing a detective game.

  • “They come out with filthy, misshapen rocks that might be something valuable – but might be worthless junk.”

    Nice metaphor. I’m stealing that one for future use!

  • because age is a stupid and intrusive question to ask. If I am old does that make me more or less right? If I am young does that make me more or less hip and happening?

    Age is just as stupid as asking hair-color so that all Brits can decided red-hairs should be ridiculed and all Americans can mock blondes.

  • Haha, I see I am on your list.

    I remember I had originally given it my real age, in order to get the “filled out your profile” badge. Then I decided, eh, how is that anybody’s business. So I tried to change my age to the accepted “I don’t want to give my age” default of 99 years. I think, but can’t say for sure, that the StackOverflow UI prevented me from doing this. This is probably why there are so many 89-year-olds wandering the StackOverflow database…

  • Meanwhile, Brent got his answer at

    (Indeed: 1920 is the minimum year that is allowed.)


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.