I love living in the city
Blog posts about people’s favorite data sets seem to be popular these days, so I’m throwing my hat in the ring.
NYC has been collecting all sorts of data from all sorts of sources. There’s some really interesting stuff in here.
Another personal favorite of mine is MTA turnstile data. If you’re a developer looking to hone your ETL skills, this is a great dataset, because it’s kind of a mess. I actually had to use PowerShell to fix inconsistencies with the older text files, which I’m still recovering from. I won’t spoil all the surprises for you.
Of course, there’s Stack Overflow.
You can’t go wrong with data from either of these sources. They’re pretty big. The main problem I have with Adventure Works is that it’s a really small database. It really doesn’t mimic the large databases that people deal with in the real world, unless you do some work run a script to make it bigger. The other problem with Adventure Works is that it went out of business a decade ago because no one wanted to buy yellow bikes. I’ve been learning a bit about Oracle, and their sample data sets are even smaller. If anyone knows of better ones, leave a comment.
Anyway, get downloading! Just don’t ask me about SSIS imports. I still haven’t opened it.
Thanks for reading!
4 Comments. Leave new
There’s plenty of unstructured random gibberish on the UK Government’s data portal, some of it interesting, some not so much…
https://data.gov.uk/
The US version (https://www.data.gov/) is also a good resource for such things.
Datasets? Check your local library’s website. Here in St. Louis the county website has a Research section with a specific area titled “Databases A-Z”. All sorts of different datatypes…
I’m going to be presenting at Code Camp and needed a dataset that would be generally interesting for listeners. I found a web page that can be loaded via Power Query about soccer EuroCup statistics: http://en.wikipedia.org/wiki/uefa_european_football_championship