Tag Archive: tweetsql

Twitter historical database of my tweeps

In my never-ending attempts to distract you from doing real work, I give you something you have absolutely no use for: a SQL Server database backup with about 100k tweets from people I’ve followed over the last couple of months.

I use Tweet-SQL to cache and analyze a lot of things from Twitter.  This database isn’t the actual one I use, but it’s just an export of a subset of tables:

  • Users – the tweeps.  The “id” field is Twitter’s internal number for you, not my own – comes from their API.  The cached_* and subscription_* fields are my own, not Twitter’s.
  • UsersHistory – whenever I fetch results from the Twitter API and someone’s information has changed, I store the old version of their profile in this table.  Typically, the field that’s changing is their followers_count.  The “id” field is my own identity number, not from Twitter’s API.
  • Statuses – the tweets (and yes, Twitter calls them Statuses).  The “id” field is from Twitter’s API.

Sample Queries

This will give you the most loudmouthed tweeps:

SELECT s.user_id, u.screen_name, COUNT(*) AS tweets
FROM dbo.Statuses s
INNER JOIN dbo.Users u ON s.user_id = u.id
GROUP BY s.user_id, u.screen_name
ORDER BY COUNT(*) DESC

Resulting in:

Top 10 Loudmouths

Top 10 Loudmouths

And this query gives you the hours when people tweet the most (in Central time):

SELECT DATEPART(hh, created_at) AS TweetHour, COUNT(*) AS RECS
FROM dbo.Statuses
GROUP BY DATEPART(hh, created_at)
ORDER BY COUNT(*) DESC

Resulting in:

Lively Times of Day

Lively Times of Day

Things to Know About the Data

There’s some holes in the data when my server bombed or the Twitter API didn’t return data correctly, and unfortunately, a lot of those holes are around the PASS Summit.  I wanted to refetch that data before giving you this database, but I’m running out of time and I’ve got other things on my plate, so I figured I’d just let this loose as is.

The database doesn’t include people with protected tweets, and it only includes things I’d see on my home page.  If someone mentioned me but I’m not following them, you won’t see it in this database export.

You can download the SQL Server database backup and restore it onto a SQL 2005 (or newer) server.  If you find anything interesting in the backup, post it here in the comments.  I’d love to see what you find!  And of course, I’d highly recommend Tweet-SQL – it’s a fun little tool if you’d like to analyze Twitter data like who’s following who, who gets retweeted the most, or what you’re missing when you’re gone.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Dump Your Twitter Friends with Tweet-SQL

I’ve been using Twitter for a while, and I decided to prune the number of people I’m following.

But before I start going through my list of friends by hand, why not strip out everybody who hasn’t been active in a certain amount of days?

To help me do that, I’ll be using Tweet-SQL.  It’s a set of stored procedures you can use to call the Twitter API from Microsoft SQL Server using plain T-SQL code.  In today’s examples, I’ve got T-SQL code that will:

  • Call the Twitter API to get the list of people I’m following (my “friends” in Twitter terms)
  • For each friend, fetch their last updates
  • Generate a web page to help me unfollow people

Buckle your seatbelts!

Our Twitter Cache Database Schema

If you read my award-winning (not really) blog post on how to use Tweet-SQL to import a Twitter feed, this schema is going to look pretty familiar.  I’ve started with the same two tables for Users and Statuses, but I’m adding a new table called UserFollows to hold the list of users that a user is following.

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Statuses](
	[id] [INT] NOT NULL,
	[created_at] [datetime] NOT NULL,
	[text] [nvarchar](200) NOT NULL,
	[SOURCE] [nvarchar](160) NOT NULL,
	[truncated] [bit] NOT NULL DEFAULT ((0)),
	[in_reply_to_status_id] [INT] NULL,
	[in_reply_to_user_id] [INT] NULL,
	[favorited] [bit] NOT NULL DEFAULT ((0)),
	[user_id] [INT] NOT NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_Statuses_cached_date]
        DEFAULT (getutcdate()),
 CONSTRAINT [PK_Statuses] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
 
CREATE TABLE [dbo].[Users](
	[id] [INT] NOT NULL,
	[name] [nvarchar](50) NULL,
	[password] [nvarchar](50) NULL,
	[screen_name] [nvarchar](50) NULL,
	[location] [nvarchar](200) NULL,
	[description] [nvarchar](160) NULL,
	[profile_image_url] [nvarchar](200) NULL,
	[url] [nvarchar](200) NULL,
	[protected] [bit] NULL,
	[followers_count] [INT] NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_Users_cached_date]  DEFAULT (getutcdate()),
	[cached_friends_status_id] [INT] NULL,
	[cached_status_id] [INT] NULL,  --New For This Demo!
	[cached_status_date] [datetime] NULL,  --New For This Demo!
 CONSTRAINT [PK_Users] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
 
CREATE TABLE [dbo].[UsersFriends](
	[id] [INT] NOT NULL,
	[user_id] [INT] NOT NULL,
	[friend_user_id] [INT] NOT NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_UsersFriends_cached_date]  DEFAULT (getutcdate()),
 CONSTRAINT [PK_UsersFriends] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Our code is going to be driven by our Twitter ID# – a number Twitter assigns each user – and it will rely on having our information already in the Users table.  The easiest way to do this is to follow the steps in my last Tweet-SQL blog post about fetching your timeline, but if you’re too lazy, here’s a quick script to do it:

EXEC dbo.tweet_cfg_resultset_send 1;
EXEC dbo.tweet_usr_show 'brento', NULL, NULL;

Those two stored procedures will configure Tweet-SQL to return recordsets, and then go fetch your user information.  The first ID# in the first recordset is your Twitter ID#.

Get Our Friends from the Twitter API

Now that we’ve got our schema set up, we can call the Twitter API to get our list of friends and populate it into the UsersFriends table.  Here’s the code to do it, a stored procedure I like to call usp_ImportFriends:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_ImportFriends]') AND TYPE IN (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ImportFriends]
GO
CREATE PROCEDURE [dbo].usp_ImportFriends @import_users_id INT
AS
  SET NOCOUNT ON
 
  DECLARE  @xml                                XML
           ,@handle                            INT
           ,@page                              INT
           ,@optional                          NVARCHAR(50)
           ,@RowsReturned                      INT
           ,@new_cached_friends_status_id INT
           ,@cached_friends_status_id     INT
 
  /* Table to join statuses and users */
  DECLARE  @imported_statuses  TABLE(
                                     id                      INT
                                     ,created_at             DATETIME
                                     ,text                   NVARCHAR(200)
                                     ,SOURCE                 NVARCHAR(160)
                                     ,truncated              BIT
                                     ,in_reply_to_status_id  INT
                                     ,in_reply_to_user_id    INT
                                     ,favorited              BIT
                                     ,user_id                INT
                                     ,user_name              NVARCHAR(50)
                                     ,user_screen_name       NVARCHAR(50)
                                     ,user_location          NVARCHAR(200)
                                     ,user_description       NVARCHAR(160)
                                     ,user_profile_image_url NVARCHAR(200)
                                     ,user_url               NVARCHAR(200)
                                     ,user_protected         BIT
                                     ,user_followers_count   INT
                                     )
 
  SET @page = 1;
  SET @optional = '?page=' + CAST(@page AS NVARCHAR(4));
 
  /* Turn off resultsets from Tweet-SQL because we'll be using XML data */
  EXEC dbo.tweet_cfg_resultset_send
    0;
 
  /* Get the first set of statuses */
  WHILE (@page <= 20)
    BEGIN
      EXEC dbo.tweet_usr_friends
        @import_users_id ,
        @optional ,
        @xml OUTPUT;
 
      EXEC sp_xml_preparedocument
        @handle OUTPUT ,
        @xml;
 
      /* Copy the XML data into a single table to make handling easier */
      INSERT INTO @imported_statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id
                  ,user_name
                  ,user_screen_name
                  ,user_location
                  ,user_description
                  ,user_profile_image_url
                  ,user_url
                  ,user_protected
                  ,user_followers_count)
      SELECT DISTINCT id
                      ,dbo.tweet_fnc_dateconvert(created_at)
                      ,text
                      ,SOURCE
                      ,truncated
                      ,in_reply_to_status_id
                      ,in_reply_to_user_id
                      ,favorited
                      ,user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   OPENXML (@handle, '/users/user', 2)
                WITH (created_at             NVARCHAR(50)  'status/created_at',
                      id                     INT           'status/id',
                      text                   NVARCHAR(200) 'status/text',
                      SOURCE                 NVARCHAR(200) 'status/source',
                      truncated              NVARCHAR(5)   'status/truncated',
                      in_reply_to_status_id  INT           'status/in_reply_to_status_id',
                      in_reply_to_user_id    INT           'status/in_reply_to_user_id',
                      favorited              NVARCHAR(5)   'status/favorited',
                      user_id                INT           'id',
                      user_screen_name       NVARCHAR(200) 'screen_name',
                      user_name              NVARCHAR(200) 'name',
                      user_location          NVARCHAR(200) 'location',
                      user_description       NVARCHAR(200) 'description',
                      user_profile_image_url NVARCHAR(200) 'profile_image_url',
                      user_url               NVARCHAR(200) 'url',
                      user_protected         NVARCHAR(5)   'protected',
                      user_followers_count   INT           'followers_count')
 
      EXEC sp_xml_removedocument
        @handle
 
      /* Insert new members that haven't been cached yet.
         The not-exists clause at the end is to avoid adding duplicate users. */
      INSERT INTO dbo.Users
                 (id
                  ,name
                  ,screen_name
                  ,location
                  ,description
                  ,profile_image_url
                  ,url
                  ,protected
                  ,followers_count)
      SELECT DISTINCT imp.user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Users u
               ON imp.user_id = u.id
      WHERE  u.id IS NULL
             AND NOT EXISTS (SELECT *
                             FROM   @imported_statuses impNewer
                             WHERE  imp.user_id = impNewer.user_id
                                    AND imp.id < impNewer.id)
 
      /* Import statuses that haven't been cached yet */
      INSERT INTO dbo.Statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id)
      SELECT DISTINCT imp.id
                      ,imp.created_at
                      ,imp.text
                      ,imp.SOURCE
                      ,imp.truncated
                      ,imp.in_reply_to_status_id
                      ,imp.in_reply_to_user_id
                      ,imp.favorited
                      ,imp.user_id
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Statuses s
               ON imp.id = s.id
      WHERE  s.id IS NULL
             AND imp.id IS NOT NULL
 
      SET @RowsReturned = (SELECT COUNT(* )
                           FROM   @imported_statuses)
      SET @page = @page + 1
      SET @optional = '?page=' + CAST(@page AS NVARCHAR(4));
 
      /* If we retrieved less than 100 rows, then we hit the last page. */
      IF @RowsReturned < 100
        BEGIN
          SET @page = 999
        END
 
      /* Empty out the holding table for the next pass */
      DELETE @imported_statuses
    END
 
  /* Update the Users table with the most recently cached record. */
  UPDATE dbo.Users
    SET cached_friends_status_id = @new_cached_friends_status_id
    WHERE id = @import_users_id
 
  SET NOCOUNT OFF

When you execute this stored procedure, call it with your ID# that we fetched earlier.  Here’s how I do it, since my ID# is 495643:

EXEC dbo.usp_ImportFriends 495643;

This fetches all of your friends, but be aware that it will burn one Twitter API request for each 100 of your friends.  If you’ve got 500 friends, that’s 5 API calls.  There’s a limit to how many API calls you can make per hour.  To get around that, you can request to be whitelisted for the Twitter API, meaning you can call it as often as you want. (My first request got denied, putting in my second one now with more info.)

Twitter has implemented a limit of 2,000 friends.  Some users have been grandfathered in with higher friend limits: if you’re one of those folks, you probably don’t care how many friends you have, but be aware that this stored proc will only fetch up to 2,000 friends, and only call the API up to 20 times.

Problem: Private User Updates Aren’t Included

After running that usp_ImportFriends stored procedure, run this query to show what users haven’t been active:

SELECT u.name, u.screen_name, u.protected,
  (SELECT TOP 1 created_at FROM dbo.Statuses s
   WHERE u.id = s.user_id ORDER BY s.created_at DESC) AS last_status_update
  FROM dbo.Users u
  ORDER BY 4

You’ll find that users who have protected their Twitter stream won’t show a recent status update. Dang! Well, for the purpose of this demo, we’re going to assume that protected users are worth following, and we won’t include them in our analysis.

Finding The Dead Wood

Let’s query the database to fetch user information, and we’ll order it by the time the user sent their last update:

SELECT u.name, u.screen_name, u.location, u.description, u.profile_image_url,
    u.url, u.followers_count, ('http://twitter.com/' + screen_name) AS twitter_url,
	(SELECT TOP 1 created_at FROM dbo.Statuses s
     WHERE u.id = s.user_id ORDER BY s.created_at DESC) AS last_status_update
  FROM dbo.Users u
  WHERE u.protected = 0
  ORDER BY 9
Twitter People To Unfollow

Twitter People To Unfollow, SQL Style

My results are shown at right, and just by glancing at them, I can see that I could easily dump a bunch of people.  For example, @HurricaneIke doesn’t really matter to me anymore, and I can see why that account hasn’t been updating lately.

I could write a T-SQL script to simply unfollow people who haven’t updated in X days, but there are some people in the list that I want to keep no matter what.  If @cwgabriel of Penny Arcade starts tweeting again, I wanna know.

Instead, what I really need is a web page that I can click on when I want to unfollow them.

I know Classic ASP, and I could write a page that would query SQL Server to build the result set, but we’re knee-deep in T-SQL here and it’s time to show off a really ugly T-SQL hack.  Let’s build the whole thing entirely with T-SQL.

Building the Web Page

This last query takes the results from our database and formats them HTML style.  This is horrendous code, and you should never ever do anything like this in production.  Don’t generate HTML inside the database.  It’s a bad idea for a million reasons.

Having said that, let’s do it!  Run this query:

SELECT ‘<p><img src=”‘ + u.profile_image_url + ‘” width=”40″ height=”40″ align=”left”>’
+ ‘<a href=”http://twitter.com/’ + u.screen_name + ‘” target=”_blank”>’ + u.screen_name + ‘</a><br />’
+ u.location + ‘<br />’
+ u.description + ‘<br />’
+ ‘URL: <a href=”‘ + u.url + ‘” target=”_blank”>’ + u.url + ‘</a><br />’
+ ‘Followers: ‘ + CAST(u.followers_count AS VARCHAR(20)) + ‘<br />’
+ ‘Last Update: ‘ + COALESCE((SELECT TOP 1 CAST(created_at AS VARCHAR(20)) FROM dbo.Statuses s
WHERE u.id = s.user_id ORDER BY s.created_at DESC),’Never’) + ‘</p><hr>’
FROM dbo.Users u
WHERE u.protected = 0
ORDER BY (SELECT TOP 1 created_at FROM dbo.Statuses s
WHERE u.id = s.user_id ORDER BY s.created_at DESC)

People to Unfollow

People to Unfollow, HTML Style

You’ll notice that this is not cleanly formatted like the rest of my SQL code examples.  That’s because WordPress, my blog software, starts choking when I mix T-SQL code with HTML in my code examples.  Heck, so does my SQL Server!

When you run that query, you’ll get a list of records that don’t mean much to you.

Open your favorite text editor and start a new file called MyFriends.html.  Put <html><body> at the top on a new line, then copy/paste your results from SQL Server into the text file.  Then add </body></html> at the end of the file, save it, and preview it with your web browser by double-clicking on the file in Windows Explorer.  If you’re lucky, it’ll look like my example at right.

Their name links to their Twitter page, and it opens in a new window so that you can race through the list unfollowing people.

Happy pruning!

By remixing the “ORDER BY” clause of the query, you can also sort by u.followers_count descending.  I find that early in my Twittering, I followed some people who seemed cool just because they were huge, like @scobleizer, but as I get more used to Twitter, I find those people less interesting.

Wondering Who’s Unfollowing You?

If you’re curious about who unfollows you, there’s a service called Qwitter that will email you whenever someone unfollows you.  I’ve had mixed luck with it – it hasn’t sent me an email in months, and I refuse to believe I’m that magnetic.

When Qwitter emails you, it includes the tweet that caused them to unfollow you – the straw that broke the camel’s back, if you will.  Don’t put too much faith into that, because it has to do with the frequency that Qwitter runs.  They might have unfollowed you several hours prior to that particular tweet.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Getting Your Twitter Feed into SQL Server with Tweet-SQL

If you’re reading this blog, you’re probably a SQL Server DBA or developer, and you’re probably into Twitter, so you’re probably going to be interested in Rhys Campbell‘s Tweet-SQL.

Tweet-SQL is a set of SQL Server stored procedures that let you work with the Twitter API just by writing T-SQL code. You can do things like:

  • Get the list of people you’re following into a database
  • Retrieve their messages into a table
  • Analyze who’s following who, and figure out who you should be following
  • Compile metrics about what’s going on out there

There are public services like MrTweet that let you do some of this, but being a DBA, I wasn’t quite satisfied with those. There’s also a free SSIS Twitter Suite by Andy Leonard and Jessica Moss, and I started playing around with that, but I suck at SSIS. I’m pretty good with T-SQL, though, and because I didn’t want to spread my skill set any thinner, I just ate the cost of Tweet-SQL and got started.

Download and Install Tweet-SQL

To get started, you’ll need a 30-day eval copy of Tweet-SQL. It’s free to evaluate, and costs £25 at the time of this writing – roughly $37 US.

Evaluate it before you buy it though, because I’ve had some difficulties getting it up and running on some workstations, especially those where I’m running SQL 2008 under a non-admin domain account on Windows 2008. I’ve ended up running Tweet-SQL in its own virtual machine with Windows 2003 and SQL Server 2005 just to keep things simple.

During setup, Tweet-SQL asks for a username and password – that’s your Twitter information, which is used to connect to Twitter. Before you click OK, click Tweet-SQL at the top of the config screen, and click Deploy. That actually deploys the Tweet-SQL stored procedures to the database of your choosing.

Calling the Twitter API with T-SQL

Start a new query in the database where you deployed Tweet-SQL and execute this code:

EXEC dbo.tweet_cfg_resultset_send 1;
EXEC dbo.tweet_sts_friends_timeline NULL, NULL;
TweetSQL Status Timeline

TweetSQL Status Timeline

The first line enables recordset outputs (instead of XML) and the second line fetches our friends’ timeline from Twitter. You should see something like the window pictured at right – you can click on that to zoom in.

Ignore the first recordset that only has one line in it. The good stuff is in the second two recordsets.

The second recordset lists the status updates. Each line is a single status update. Note the “status_Id” column, second to last one – it starts at 0 and goes up.

The third recordset lists the users who made each status update. It joins to the second recordset on the “status_Id” column. Yes, in a perfect world, these would already be joined together for us, but we’re big bad database people, and that’s not a problem for us. Let’s create some tables to store these results.

Our Twitter Cache Database Schema

In order to pull our data into a database, we’re going to need some tables. For today’s lesson, we’re only going to create two tables:

  • Users – stores the list of people we’ve seen on Twitter
  • Statuses – stores the things they’ve said.

The Twitter API has a call that gets our status timeline, and it includes some basic information about each user. We’re going to kill two birds with one stone: whenever someone tweets, we’re going to store their user profile data in the User table too. There are separate Twitter API calls that will fetch our followers and our friends, but let’s learn to crawl first.

Here’s the script to create the tables. Be aware that I have zero clue what the actual field types are – these may need to be changed over time from nvarchar(200) to nvarchar(240), for example. (Yes, believe it or not, Twitter fields can be longer than 140 characters, as we’ll show later.)

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Statuses](
	[id] [INT] NOT NULL,
	[created_at] [datetime] NOT NULL,
	[text] [nvarchar](200) NOT NULL,
	[SOURCE] [nvarchar](160) NOT NULL,
	[truncated] [bit] NOT NULL DEFAULT ((0)),
	[in_reply_to_status_id] [INT] NULL,
	[in_reply_to_user_id] [INT] NULL,
	[favorited] [bit] NOT NULL DEFAULT ((0)),
	[user_id] [INT] NOT NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_Statuses_cached_date]
        DEFAULT (getutcdate()),
 CONSTRAINT [PK_Statuses] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
 
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Users](
	[id] [INT] NOT NULL,
	[name] [nvarchar](50) NULL,
	[password] [nvarchar](50) NULL,
	[screen_name] [nvarchar](50) NULL,
	[location] [nvarchar](200) NULL,
	[description] [nvarchar](160) NULL,
	[profile_image_url] [nvarchar](200) NULL,
	[url] [nvarchar](200) NULL,
	[protected] [bit] NULL,
	[followers_count] [INT] NULL,
	[cached_date] [datetime] NOT NULL 
	  CONSTRAINT [DF_Users_cached_date]  DEFAULT (getutcdate()),
	[cached_friends_status_id] [INT] NULL,
 CONSTRAINT [PK_Users] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

I’m not doing any fancy indexing here, just the bare minimum to get started.

The Logistics of Importing Twitter Data into SQL Server

Before we go wild and crazy with the code, let’s talk through it in pseudocode. Here’s what we need to accomplish:

  • Call the Twitter API to get our most recent messages in XML format
  • Store that data in a temporary table for faster juggling
  • Insert users into our Users table if they don’t already exist
  • Insert statuses into our Status table if they don’t already exist

We could just write this as a stored procedure, and then call it as a SQL Server Agent job every minute. However, the Twitter API has a couple of limitations that we need to watch out for.

The API only sends up to 200 results back at a time. If we call this every 15 minutes for users who are following a lot of people, we could easily lose messages if more than 200 come through in 15 minutes. We could work around that by calling the API every minute or so, but that leads into our second challenge: we’re only allowed to call the API a certain number of times per hour.

To help work around this, the Twitter API has an extra parameter: we can specify a status ID#, and Twitter can return just the new messages since that ID#. This way, I can call the API once per hour, specify my most recent cached status #, and I’ll only get the new messages. I can also call it with a “page” parameter, so we can get several groups of 200 results one page at a time until we’ve exhausted all of the data.

On our Users table, there’s a cached_friends_status_id field: that’s where we’re going to store the most recent status ID. Why wouldn’t we just query for the max id in the Statuses table? I’m glad you asked, because that shows that you’re smart enough to keep an eye out for Lesson 2, where we’ll get into more advanced uses of these tables.

With that in mind, here’s what our pseudocode will look like:

  • Query the Users table to find the most recent cached status ID for our user
  • Call the Twitter API to get our most recent statuses since the most recent cached status ID
  • Store that data in a temporary table for faster juggling
  • Insert users into our Users table if they don’t already exist
  • Insert statuses into our Status table if they don’t already exist
  • If we retrieved 200 results, there’s probably more, so go back to step 2, but ask for the next page of results.
  • Update the Users table with the most recent cached status ID

The Stored Procedure to Import Twitter Data

With no further ado, here’s the stored procedure:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF  EXISTS (SELECT * FROM sys.objects
  WHERE object_id = OBJECT_ID(N'[dbo].[usp_ImportStatusesFromFriends]')
  AND TYPE IN (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ImportStatusesFromFriends]
GO
CREATE PROCEDURE [dbo].[usp_ImportStatusesFromFriends] @import_users_id INT = NULL
AS
  SET NOCOUNT ON
 
  DECLARE  @xml                                XML
           ,@handle                            INT
           ,@page                              INT
           ,@optional                          NVARCHAR(50)
           ,@RowsReturned                      INT
           ,@new_cached_friends_status_id INT
           ,@cached_friends_status_id     INT
 
  /* Table to join statuses and users */
  DECLARE  @imported_statuses  TABLE(
                                     id                      INT
                                     ,created_at             DATETIME
                                     ,text                   NVARCHAR(200)
                                     ,SOURCE                 NVARCHAR(160)
                                     ,truncated              BIT
                                     ,in_reply_to_status_id  INT
                                     ,in_reply_to_user_id    INT
                                     ,favorited              BIT
                                     ,user_id                INT
                                     ,user_name              NVARCHAR(50)
                                     ,user_screen_name       NVARCHAR(50)
                                     ,user_location          NVARCHAR(200)
                                     ,user_description       NVARCHAR(160)
                                     ,user_profile_image_url NVARCHAR(200)
                                     ,user_url               NVARCHAR(200)
                                     ,user_protected         BIT
                                     ,user_followers_count   INT
                                     )
 
  /* Get the last cached status ID if it's available */
  SELECT @cached_friends_status_id = cached_friends_status_id
    FROM dbo.Users
    WHERE id = @import_users_id
  SET @page = 1;
  SET @optional = '?count=200&amp;page=' + CAST(@page AS NVARCHAR(4));
 
  IF @cached_friends_status_id IS NOT NULL
    BEGIN
      SET @optional = @optional + '&amp;since_id='
        + CAST(@cached_friends_status_id AS NVARCHAR(30))
    END
 
  /* Turn off resultsets from Tweet-SQL because we'll be using XML data */
  EXEC dbo.tweet_cfg_resultset_send
    0;
 
  /* Get the first set of statuses */
  WHILE (@page &lt;= 10)
    BEGIN
      EXEC dbo.tweet_sts_friends_timeline
        @optional ,
        @xml OUTPUT;
 
      EXEC sp_xml_preparedocument
        @handle OUTPUT ,
        @xml;
 
      /* Copy the XML data into a single table to make handling easier */
      INSERT INTO @imported_statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id
                  ,user_name
                  ,user_screen_name
                  ,user_location
                  ,user_description
                  ,user_profile_image_url
                  ,user_url
                  ,user_protected
                  ,user_followers_count)
      SELECT DISTINCT id
                      ,dbo.tweet_fnc_dateconvert(created_at)
                      ,text
                      ,SOURCE
                      ,truncated
                      ,in_reply_to_status_id
                      ,in_reply_to_user_id
                      ,favorited
                      ,user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   OPENXML (@handle, '/statuses/status/user', 2)
                WITH (created_at             NVARCHAR(50)  '../created_at',
                      id                     INT           '../id',
                      text                   NVARCHAR(200) '../text',
                      SOURCE                 NVARCHAR(200) '../source',
                      truncated              NVARCHAR(5)   '../truncated',
                      in_reply_to_status_id  INT           '../in_reply_to_status_id',
                      in_reply_to_user_id    INT           '../in_reply_to_user_id',
                      favorited              NVARCHAR(5)   '../favorited',
                      user_id                INT           'id',
                      user_screen_name       NVARCHAR(200) 'screen_name',
                      user_name              NVARCHAR(200) 'name',
                      user_location          NVARCHAR(200) 'location',
                      user_description       NVARCHAR(200) 'description',
                      user_profile_image_url NVARCHAR(200) 'profile_image_url',
                      user_url               NVARCHAR(200) 'url',
                      user_protected         NVARCHAR(5)   'protected',
                      user_followers_count   INT           'followers_count')
 
      EXEC sp_xml_removedocument
        @handle
 
      /* Insert new members that haven't been cached yet.
         The not-exists clause at the end is to avoid adding duplicate users. */
      INSERT INTO dbo.Users
                 (id
                  ,name
                  ,screen_name
                  ,location
                  ,description
                  ,profile_image_url
                  ,url
                  ,protected
                  ,followers_count)
      SELECT DISTINCT imp.user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Users u
               ON imp.user_id = u.id
      WHERE  u.id IS NULL
             AND NOT EXISTS (SELECT *
                             FROM   @imported_statuses impNewer
                             WHERE  imp.user_id = impNewer.user_id
                                    AND imp.id &lt; impNewer.id)
 
      /* Import statuses that haven't been cached yet */
      INSERT INTO dbo.Statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id)
      SELECT DISTINCT imp.id
                      ,imp.created_at
                      ,imp.text
                      ,imp.SOURCE
                      ,imp.truncated
                      ,imp.in_reply_to_status_id
                      ,imp.in_reply_to_user_id
                      ,imp.favorited
                      ,imp.user_id
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Statuses s
               ON imp.id = s.id
      WHERE  s.id IS NULL
 
      /* Update the Groups with the most recent messageid */
      SET @RowsReturned = (SELECT COUNT(* )
                           FROM   @imported_statuses)
      IF @page = 1
         AND @RowsReturned &gt;= 1
        BEGIN
          SET @new_cached_friends_status_id = (SELECT   TOP 1 id
                                                    FROM     @imported_statuses
                                                    ORDER BY id DESC)
        END
 
      SET @page = @page + 1
      SET @optional = '?count=200&amp;page=' + CAST(@page AS NVARCHAR(4));
      IF @cached_friends_status_id IS NOT NULL
        BEGIN
          SET @optional = @optional + '&amp;since_id='
            + CAST(@cached_friends_status_id AS NVARCHAR(30))
        END
 
      /* If we retrieved less than 200 rows, then we hit the last page. */
      IF @RowsReturned &lt; 200
        BEGIN
          SET @page = 999
        END
 
      /* Empty out the holding table for the next pass */
      DELETE @imported_statuses
    END
 
  /* Update the Users table with the most recently cached record. */
  UPDATE dbo.Users
    SET cached_friends_status_id = @new_cached_friends_status_id
    WHERE id = @import_users_id
 
  SET NOCOUNT OFF

Are you still with me? Whew, good. I think we lost a few people there.

The stored proc takes an optional @import_users_id parameter – that’s your Twitter numeric user_id. You won’t know it yet, but after you run the stored proc, you can query your Users table to find yourself, assuming you posted a status update recently.

When you run the stored procedure like this:

EXEC [dbo].[usp_ImportStatusesFromFriends]

you should get these messages:

200 OK: everything went awesome.
200 OK: everything went awesome.
200 OK: everything went awesome.
200 OK: everything went awesome.
200 OK: everything went awesome.

That means, well, everything went awesome. Your Users and Statuses tables should now be populated.

At this point, you can get your user ID:

SELECT * FROM dbo.Users WHERE name = 'YourUsernameHere'

And then from that point forward, you can call the stored proc with your user id, thereby enabling the last-cached-id code.

When Things Don’t Go Awesome

I’ve been running this code for a while, and it’s very much alpha code. It may deflower your laptop at a moment’s notice.

When things break, it’s usually because the Twitter API is down or because my database schema doesn’t fully handle Twitter’s fields. For example, I’ve had several cases where they’ve delivered data that’s wider than I expected – those NVARCHAR(200) fields didn’t start out quite that wide. Over time, I’ll come back and revise the table schemas to be more closely aligned with whatever Twitter’s actually storing.

Now that I’m thinking of it, I didn’t test this code on a case sensitive server. I’m usually really good about that. Doh!

Stuff To Do When You’ve Got Twitter Data In Your Database

I’ve just scratched the surface of what you can do with Tweet-SQL. The Twitter API is pretty darned powerful, and it enables you to do things like:

  • Find out who your followers are following, which might be people you also find interesting
  • Find influential people in your Twitter network
  • Get recap reports of what links have been sent out recently, or what hot topics are out there
  • Get peoples’ updates without actually following them directly (aka stalking, hee hee ho ho)
  • Build a web-based equivalent to TweetDeck, with multiple user functionality

For sample scripts, check out SQLCLRNews, Rhys’s blog where he posts how-to examples for Tweet-SQL.

If you found this interesting, I’d suggest following rhyscampbell on Twitter, and of course, me – I’m BrentO.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts