Blog

In my never-ending attempts to distract you from doing real work, I give you something you have absolutely no use for: a SQL Server database backup with about 100k tweets from people I’ve followed over the last couple of months.

I use Tweet-SQL to cache and analyze a lot of things from Twitter.  This database isn’t the actual one I use, but it’s just an export of a subset of tables:

  • Users – the tweeps.  The “id” field is Twitter’s internal number for you, not my own – comes from their API.  The cached_* and subscription_* fields are my own, not Twitter’s.
  • UsersHistory – whenever I fetch results from the Twitter API and someone’s information has changed, I store the old version of their profile in this table.  Typically, the field that’s changing is their followers_count.  The “id” field is my own identity number, not from Twitter’s API.
  • Statuses – the tweets (and yes, Twitter calls them Statuses).  The “id” field is from Twitter’s API.

Sample Queries

This will give you the most loudmouthed tweeps:

SELECT s.user_id, u.screen_name, COUNT(*) AS tweets
FROM dbo.Statuses s
INNER JOIN dbo.Users u ON s.user_id = u.id
GROUP BY s.user_id, u.screen_name
ORDER BY COUNT(*) DESC

Resulting in:

Top 10 Loudmouths

Top 10 Loudmouths

And this query gives you the hours when people tweet the most (in Central time):

SELECT DATEPART(hh, created_at) AS TweetHour, COUNT(*) AS RECS
FROM dbo.Statuses
GROUP BY DATEPART(hh, created_at)
ORDER BY COUNT(*) DESC

Resulting in:

Lively Times of Day

Lively Times of Day

Things to Know About the Data

There’s some holes in the data when my server bombed or the Twitter API didn’t return data correctly, and unfortunately, a lot of those holes are around the PASS Summit.  I wanted to refetch that data before giving you this database, but I’m running out of time and I’ve got other things on my plate, so I figured I’d just let this loose as is.

The database doesn’t include people with protected tweets, and it only includes things I’d see on my home page.  If someone mentioned me but I’m not following them, you won’t see it in this database export.

You can download the SQL Server database backup and restore it onto a SQL 2005 (or newer) server.  If you find anything interesting in the backup, post it here in the comments.  I’d love to see what you find!  And of course, I’d highly recommend Tweet-SQL – it’s a fun little tool if you’d like to analyze Twitter data like who’s following who, who gets retweeted the most, or what you’re missing when you’re gone.

↑ Back to top
  1. I didn’t need data mining to guess that Jorge would be at the top of the list. :)

  2. Brent, should it concern us that you, myself and mrdenny are sandwiched between two bot accounts?

  3. Jorge is Borg. Prepare to be assimilated.

  4. Cool, I’m in the top 50, even without SQLPASS!

    I was really disappointed to see PASS wasn’t in there :( I was looking for something specific. If you happen to get the more fuller-est data together, reply in comments or email pretty please.

    Oh, and you inspired me to write a beginner level RESTORE WITH MOVE blog: http://midnightdba.itbookworm.com/midnightdba/blog/post/Restore-Database-WITH-MOVE—A-Review.aspx
    Smoothies,
    Jen

    • Yeah, I’ll definitely keep trying to capture the historical data. The problem is Twitter’s API – it keeps throwing problems about rate limits even for unlimited accounts. I can’t fetch the old data fast enough. (sigh)

  5. Hey Brent,
    How about sticking it up on SQL Azure and handing out a read-only username/password for us all to use before Microsoft start charging for using the service!!

    If you don’t want to let me know and I might just do it myself!
    -Jamie

  6. I made the top 10? Seriously? Wow. At least I’m not Jorge.

  7. Pingback: SSIS Junkie : TwitterCache now hosted on SQL Azure

  8. First time connecting to SQL Azure (Thanks Jamie !), impressed by the speed….
    Looks like Tweetdeck is by far the source most used (in this data set anyway)

    SELECT TOP 20
    COUNT(*) #,
    CASE
    WHEN RIGHT(s.source, 4) = ‘‘
    THEN SUBSTRING(source, CHARINDEX(‘nofollow’, source) + 10, LEN(source) – (CHARINDEX(‘nofollow’, source) + 13))
    ELSE source
    END source
    FROM statuses s
    GROUP BY source
    ORDER BY # DESC

  9. Pingback: Weekly Link Post 122 « Rhonda Tipton’s WebLog

  10. Pingback: SSIS Junkie : Tweetpoll and RESTful Northwind go bye-bye

  11. Hi Brent,

    Interesting database! I ran this to determine who created Twitter accounts when:

    SELECT distinct s.User_Id, u.screen_name
    FROM dbo.Statuses s
    INNER JOIN dbo.Users u ON s.USER_ID = u.id
    order by s.User_Id

    :{> Andy

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php