I love living in the city
Blog posts about people’s favorite data sets seem to be popular these days, so I’m throwing my hat in the ring.
NYC has been collecting all sorts of data from all sorts of sources. There’s some really interesting stuff in here.
Another personal favorite of mine is MTA turnstile data. If you’re a developer looking to hone your ETL skills, this is a great dataset, because it’s kind of a mess. I actually had to use PowerShell to fix inconsistencies with the older text files, which I’m still recovering from. I won’t spoil all the surprises for you.
Of course, there’s Stack Overflow.
You can’t go wrong with data from either of these sources. They’re pretty big. The main problem I have with Adventure Works is that it’s a really small database. It really doesn’t mimic the large databases that people deal with in the real world, unless you
do some work run a script to make it bigger. The other problem with Adventure Works is that it went out of business a decade ago because no one wanted to buy yellow bikes. I’ve been learning a bit about Oracle, and their sample data sets are even smaller. If anyone knows of better ones, leave a comment.
Anyway, get downloading! Just don’t ask me about SSIS imports. I still haven’t opened it.
Thanks for reading!