In my never-ending attempts to distract you from doing real work, I give you something you have absolutely no use for: a SQL Server database backup with about 100k tweets from people I’ve followed over the last couple of months.
I use Tweet-SQL to cache and analyze a lot of things from Twitter. This database isn’t the actual one I use, but it’s just an export of a subset of tables:
- Users – the tweeps. The “id” field is Twitter’s internal number for you, not my own – comes from their API. The cached_* and subscription_* fields are my own, not Twitter’s.
- UsersHistory – whenever I fetch results from the Twitter API and someone’s information has changed, I store the old version of their profile in this table. Typically, the field that’s changing is their followers_count. The “id” field is my own identity number, not from Twitter’s API.
- Statuses – the tweets (and yes, Twitter calls them Statuses). The “id” field is from Twitter’s API.
Sample Queries
This will give you the most loudmouthed tweeps:
1 2 3 4 5 |
SELECT s.user_id, u.screen_name, COUNT(*) AS tweets FROM dbo.Statuses s INNER JOIN dbo.Users u ON s.user_id = u.id GROUP BY s.user_id, u.screen_name ORDER BY COUNT(*) DESC |
Resulting in:

And this query gives you the hours when people tweet the most (in Central time):
1 2 3 4 |
SELECT DATEPART(hh, created_at) AS TweetHour, COUNT(*) AS RECS FROM dbo.Statuses GROUP BY DATEPART(hh, created_at) ORDER BY COUNT(*) DESC |
Resulting in:

Things to Know About the Data
There’s some holes in the data when my server bombed or the Twitter API didn’t return data correctly, and unfortunately, a lot of those holes are around the PASS Summit. I wanted to refetch that data before giving you this database, but I’m running out of time and I’ve got other things on my plate, so I figured I’d just let this loose as is.
The database doesn’t include people with protected tweets, and it only includes things I’d see on my home page. If someone mentioned me but I’m not following them, you won’t see it in this database export.
You can download the SQL Server database backup and restore it onto a SQL 2005 (or newer) server. If you find anything interesting in the backup, post it here in the comments. I’d love to see what you find! And of course, I’d highly recommend Tweet-SQL – it’s a fun little tool if you’d like to analyze Twitter data like who’s following who, who gets retweeted the most, or what you’re missing when you’re gone.
19 Comments. Leave new
I didn’t need data mining to guess that Jorge would be at the top of the list. 🙂
Brent, should it concern us that you, myself and mrdenny are sandwiched between two bot accounts?
HAHAHA, no, @FancyFembot isn’t a bot. She’s a scifi fan in Houston that I ran into in my coworking days.
lol ok that makes me feel a LITTLE better…but still, we’re outtweeting bots. Am I a bot? AM I IN TEH MATRIX??!?
Jorge is Borg. Prepare to be assimilated.
Cool, I’m in the top 50, even without SQLPASS!
I was really disappointed to see PASS wasn’t in there 🙁 I was looking for something specific. If you happen to get the more fuller-est data together, reply in comments or email pretty please.
Oh, and you inspired me to write a beginner level RESTORE WITH MOVE blog: http://midnightdba.itbookworm.com/midnightdba/blog/post/Restore-Database-WITH-MOVE—A-Review.aspx
Smoothies,
Jen
Yeah, I’ll definitely keep trying to capture the historical data. The problem is Twitter’s API – it keeps throwing problems about rate limits even for unlimited accounts. I can’t fetch the old data fast enough. (sigh)
Hey Brent,
How about sticking it up on SQL Azure and handing out a read-only username/password for us all to use before Microsoft start charging for using the service!!
If you don’t want to let me know and I might just do it myself!
-Jamie
Dude, that’s a great idea! If you’ve got the time, go for it – I’m tied up today doing my real job. 😉 If you don’t do it this week, though, I’ll tackle it this weekend.
Okey dokey. I’ll try and tackle it tonight!
OK, done: http://sqlblog.com/blogs/jamie_thomson/archive/2009/12/03/twittercache-now-hosted-on-sql-azure.aspx
I made the top 10? Seriously? Wow. At least I’m not Jorge.
First time connecting to SQL Azure (Thanks Jamie !), impressed by the speed….
Looks like Tweetdeck is by far the source most used (in this data set anyway)
SELECT TOP 20
COUNT(*) #,
CASE
WHEN RIGHT(s.source, 4) = ‘‘
THEN SUBSTRING(source, CHARINDEX(‘nofollow’, source) + 10, LEN(source) – (CHARINDEX(‘nofollow’, source) + 13))
ELSE source
END source
FROM statuses s
GROUP BY source
ORDER BY # DESC
hmm, that query won’t work because I can’t post the html : WHEN RIGHT(s.source,4) = ”
should have this between the quotes : less than / a greater than
(without spaces)
Hi Brent,
Interesting database! I ran this to determine who created Twitter accounts when:
SELECT distinct s.User_Id, u.screen_name
FROM dbo.Statuses s
INNER JOIN dbo.Users u ON s.USER_ID = u.id
order by s.User_Id
:{> Andy
Well, kinda sorta – that also lines up with when I started following them. They may have been tweeting before. But there’s an app for that:
http://www.whendidyoujointwitter.com/
That’s one of those fun chest-beating things. “I joined before you, nyah nyah nyah.” 😉