Tag Archive: tsql

Check All Your MSDB Cleanup Jobs With One Query

Are all of your servers set up to clean out their backup history periodically?

Are you sure?

If you looked back at the monitor with “WTF?” written all over your face, I’ll bring you up to speed.  Here’s the problem: SQL Server stores information about completed backups in the MSDB database.  In older versions of SQL Server, that database isn’t indexed very well, and it can become a performance bottleneck during backups and restores.  I’ve written about one particular case where I cut the company’s backup times by 2/3 just by cleaning out their backup history.

To make sure your servers are keeping their MSDB clean, you can use SQL Server Management Studio 2008′s new multi-server-query-execute feature to quickly query all of your servers.  Go read my article on how to query multiple servers at once in SSMS, and then run this query against your SQL Servers:

SELECT TOP 1 backup_start_date
  FROM msdb.dbo.backupset WITH (NOLOCK)
  ORDER BY backup_set_id ASC

The query runs almost instantaneously since it’s sorted by the primary key of the table.  The results of the query are shown here:

Consolidated MSDB Backup History

Consolidated MSDB Backup History

In that screenshot, I can see right away that three of the servers in my lab have problems.  They don’t have cleanup tasks properly configured.

If you see results like this in your own environment, drop what you’re doing and go add MSDB cleanup tasks to your maintenance plans or add sp_delete_backuphistory to your T-SQL maintenance scripts.  The first time it runs, you may run into some horribly long runtimes and locks, unfortunately.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

SQL Server Index Tuning Tip: Identify Overlaps

Performance Tuning 101 - Add More Spoilers

Performance Tuning 101 - Add More Spoilers

If you’ve got performance troubles with an application that stores data in SQL Server, and especially if it’s a home-grown application (not a store-bought app), you can get dramatic performance improvements simply by focusing on some basic indexing techniques.  These tips and tricks pay off more than pouring money into hardware that might look good sitting in the datacenter, but doesn’t really make the application significantly faster.

When I go into a shop to speed up an application I’ve never seen before, two of my favorite quick-hits are from the index performance tuning queries from SQLServerPedia:

  • Find unused indexes – these are indexes the SQL Server engine says it’s not using.  Unused indexes incur a speed penalty because SQL Server still has to add/update the indexes as records change, so they make writes slower.
  • Find missing indexes – these are indexes SQL Server wishes it had available.

I’m not going to cover those in detail this week because I’ve already recorded tutorial videos over at SQLServerPedia for those, but I do want to focus on something these queries won’t pick up.  Sometimes a table has two nearly-identical indexes, and they’re both being used for reads.  Take these two:

CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType] ON [dbo].[MyTable]
([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC
)
 
CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_QuotaItemDriverID] ON [dbo].[MyTable]
([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC,
[QuotaItemDriverID] ASC,
)
Performance Tuning 201 - Even Wood 2x4s Can Be Spoilers

Performance Tuning 201 - Even Wood 2x4s Can Be Spoilers

They’re two different indexes, and they’re both getting used – but does that mean we need them both?

They’re very nearly identical – but the second index has one extra field.  When the SQL Server engine gets a query that needs RunID, SiteID, DataSource, OutputType, and PeriodType – but not QuotaItemDriverID – then it will use the first index.  When it gets a query that needs all six fields, then it’ll use the second index.

In cases like this, I prefer to drop that first index and let the slightly bigger index pick up the slack.  Reading a slightly larger index will take slightly more time: if a query didn’t need that QuotaItemDriverID field, it still has to pull it off the disk in order to perform the query.  However, dropping the index pays off during inserts/updates/deletes, because it’s one less index SQL Server has to manage.  It also makes the database smaller, thereby making database maintenance tasks smaller/faster.

If:

  • I have two indexes with the exact same fields in the same order, but
  • One has 1-2 extra fields, and
  • There aren’t include fields, or the include fields are the same

Then I’ll drop the shorter index with extreme prejudice.

When Indexes Have Include Fields

If they have “include” fields, then I’ll merge the include fields between the two indexes to make one index to serve both needs.  Say we have these two indexes:

CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_Includes] ON [dbo].[MyTable]
([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC
) INCLUDE ( [YTDRevenue], [MTDRevenue] )
 
CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_QuotaItemDriverID_Includes] ON [dbo].[MyTable]([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC,
[QuotaItemDriverID] ASC,
) INCLUDE ( [SalespersonID], [MTDRevenue] )

The first index includes the YTDRevenue field, but the second index doesn’t.  If I just drop the first index, then queries that needed that field won’t get the full speed benefits from the second index.  To merge the two indexes, I need to drop both indexes and recreate the second one with the YTDRevenue field included, like this:

CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_QuotaItemDriverID_Includes] ON [dbo].[MyTable]([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC,
[QuotaItemDriverID] ASC,
) INCLUDE ( [SalespersonID], [MTDRevenue], [YTDRevenue] )

In this example, I tacked the YTDRevenue field on to the end of the include field list.  The order of the included fields doesn’t matter, since SQL Server doesn’t sort by those.

Performance Tuning 301 - Beauty Just Adds Weight

Performance Tuning 301 - Beauty Just Adds Weight

Things to Watch Out For

In my examples, I kept things simple by omitting all of the extra indexing options like partitioning and sorting in TempDB.  When doing index tuning in real life, though, you’ll want to check those options to make sure they’re consistent from index to index.

Field order matters in indexes; if two indexes have the same fields but in different order, that doesn’t mean you can drop one of them.

Ideally, after making index changes, we would restart the SQL Server instance to reset the DMV counters that monitor index use.  In reality, though, that’s not so easy to pull off, so we need to log our changes to understand what the changes have been.  After making index changes, log the changes somewhere. I keep the output of the index performance tuning DMV queries in Excel spreadsheets because it’s easier to email those back and forth from machine to machine, especially when I’m consulting. The next time you do index performance tuning on the same database, you can use the historical spreadsheets to determine whether or not your changes worked the way you’d planned.

SQL Server 2008 Query Performance Tuning Distilled

SQL Server 2008 Query Performance Tuning Distilled

Learn More About SQL Server Index Tuning

I really like Grant Fritchey’s book SQL Server 2008 Query Performance Tuning Distilled, and I wrote a book review about it.  I can’t recommend it highly enough, and I’d start there.

If you don’t have the patience to wait for a book, here’s a few more blog posts about performance tuning:

Performance Tuning with Perfmon – how to set up Perfmon, what SQL Server Perfmon counters to track what the indicators mean.

Data Mining Your SQL Server Perfmon Counters – want to take your Performance Monitor statistics to a new level? I wrote an article on SQLServerPedia explaining how to use Microsoft’s free Table Analysis Tools for the Cloud plugins to dive deeply into your data.

Primary Keys and Indexes – I explain the concepts behind keys and indexes using phone books as an example. Indexes have huge impacts on performance, and if you master these you can make your server go a whole lot faster without spending more money.

SQL Server 2005 Setup Checklist – some simple configuration tweaks can get you 20-30% performance increases right from the start without spending any extra money.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

How to Import the StackOverflow XML into SQL Server

Want to play around with the StackOverflow database export?  Here’s how to import the XML files into SQL Server, and some notes about the tables and data schema.

Script to Import StackOverflow XML to SQL Server

This T-SQL script will create six stored procedures:

  • usp_ETL_Load_Badges
  • usp_ETL_Load_Comments
  • usp_ETL_Load_Posts
  • usp_ETL_Load_Users
  • usp_ETL_Load_Votes
  • usp_ETL_Load_PostsTags (which isn’t one of the StackOverflow tables – more on that in a minute)

The XML import code is from an excellent XML tutorial by Denny Cherry.  The scripts create a table (named Badges, Comments, Posts, Users, Votes) for each XML file.  The schema matches the XML file with one exception – I added an identity field to the Badges table.  The rest already had Id fields.  The tables don’t have any indexes to speed querying. I would highly recommend that you not change the schema of any of these tables, because I’ll be giving out more scripts over the coming days and weeks that rely on the base tables.  If you want to add more data, add additional tables.  Plus this will keep your importing clean anyway – you can dump and reload the StackOverflow data repeatedly as long as you keep that data separate.

After importing, the database is about 2gb of data.  Be aware that depending on your database’s recovery model and how you run these stored procs, your log file may be 2gb as well. None of the sentences in this paragraph blend together well, which bothers me but not quite enough to stop publishing the blog entry. Anyway, on we go.

If the table already exists when the stored proc runs, the table contents are deleted using the TRUNCATE TABLE command, which requires hefty permissions.  If you don’t have admin rights on the box, substitute DELETE for the five TRUNCATE TABLE commands.  Using DELETE will take significantly longer to run.  For reference, with TRUNCATE TABLE, the stored procs take around 10 minutes on my faster machines, and around half an hour on my slower virtual machines.

These stored procs only work for the new database dump released on Monday morning, not the one released last week.  If you get invalid XML errors while importing, you’ve got the older database dump.  Go get the fresh hotness.

Now for some schema notes, and I’m going to go out of alphabetical order because everything links back to the Users table.  I’m only going to cover the fields that aren’t immediately obvious:

Users Table

  • Id – primary key, identity field from the original StackOverflow database.  Id 1 is “Community”, which is a special user that denotes community ownership, like wiki questions and answers.
  • LastAccessDate – this is useful because it tells you when the data export was last updated.  If you’re doing queries for things like the last 30 days, check the most recent date here.
  • Age – the user enters this manually, so it’s not terribly reliable as I discovered earlier.
  • AboutMe – I’m using an nvarchar(max) field here, but you can go with a shorter field like nvarchar(2000).
  • UpVotes and DownVotes – the number of votes this user has cast.

Posts Table

In StackOverflow, questions and answers are both considered posts.  If a record has a null ParentId field, then it’s a question.  Otherwise, it’s an answer, and to find the matching question, join the ParentId field up to Posts.Id.

  • Id – primary key, identity field from the original StackOverflow database.
  • Title – the title of the question.  Answer titles will be null.
  • OwnerUserId – joins back to Users.Id.  If OwnerUserId = 1, that’s the community user, meaning it’s a wiki question or answer.
  • AcceptedAnswerId – for questions, this points to the Post.Id of the officially accepted answer.  This isn’t necessarily the highest-voted answer, but the one the questioner accepted.
  • Tags – okay, time to blow out of the bullet points for a second.

StackOverflow limits you to five tags per question (answers aren’t tagged), and all five are stored in this field.  For example, for question 305223, the Tags field is “<offtopic><fun><not-programming-related><jon-skeet>”.  It’s up to you to normalize these.  If you’d like to normalize them out into a child table, check out the usp_ETL_Load_PostsTags stored proc, which creates a PostsTags table with PostId and Tag fields.  Each Posts record (questions only) will then have several child records in PostsTags.

Next, check the contents of the Tag field carefully.  StackOverflow allows periods in the tag, like the .NET tag and ASP.NET tag.  However, in the database, these are stored as “aspûnet”.  Just something to be aware of.

Comments Table

  • Id – primary key, identity field from the original StackOverflow database.
  • PostId – the post parent for this comment.  Joins to the Post.Id field.
  • UserId – who left the comment.  Joins to the User.Id field.

Badges Table

  • Id – an identity field for a primary key.  This number is meaningless – I just added it for some referential integrity.
  • UserId – joins back to Users.Id to show whose badge it is.
  • Name – the name of the Badge, like Teacher or Nice Answer.
  • CreationDate – when the user achieved the badge.

Votes Table

This stores the votes cast on posts, but the key field is VoteTypeId.  The VoteType table wasn’t included in the export, so this table isn’t too useful yet, but if the guys give me the OK I’ll post the contents of that table here.  The Votes table doesn’t include *who* cast the votes, and I’ve got my hands full analyzing the other tables anyway, so I haven’t been interested in the VoteTypes yet.

All of the Id fields except for Badges.Id are from StackOverflow’s original database.  In theory, these numbers will not change, which means if you build your own child table structures like UserBaconPreferences, and you join via User.Id, you should be able to blow away and reload the Users table with every new StackOverflow database dump.  That’s the theory, but in reality, you shouldn’t rely on anybody else’s ID fields, because there’s no reason to believe these won’t completely change down the road.  Who knows – Jeff might switch over to GUIDs as primary keys.

Sample Questions Query

Once you’ve got it all together, you can do some fun stuff. Let’s look at some overall statistics about questions (not answers):

SELECT COALESCE(COUNT(DISTINCT p.ID),0)           AS Questions
       ,COALESCE(AVG(p.Score * 1.00),0)           AS AvgScore
       ,COALESCE(AVG(p.ViewCount * 1.00),0)       AS AvgViewCount
       ,COALESCE(COUNT(DISTINCT p.OwnerUserId),0) AS DistinctQuestioners
       ,COALESCE(AVG(p.AnswerCount * 1.00),0)     AS AvgAnswerCount
       ,COALESCE(AVG(p.CommentCount * 1.00),0)    AS AvgCommentCount
       ,COALESCE(AVG(p.FavoriteCount * 1.00),0)   AS AvgFavoriteCount
       ,COALESCE(COUNT(ClosedDate),0)             AS ClosedQuestions
       ,COALESCE(AVG(u.Reputation * 1.00),0)      AS AvgQuestionerReputation
       ,COALESCE(AVG(u.Age * 1.00),0)             AS AvgQuestionerAge
       ,COALESCE(AVG(u.UpVotes * 1.00),0)         AS AvgQuestionerUpVotes
       ,COALESCE(AVG(u.DownVotes * 1.00),0)       AS AvgQuestionerDownVotes
FROM   dbo.Posts p
       INNER JOIN dbo.Users u
         ON p.OwnerUserId = u.Id
WHERE p.Tags IS NOT NULL

And some of the results are:

  • Questions – 176,137
  • Average Score – 1.89
  • Average View Count – 311
  • Distinct Questioners – 39,795 (meaning anyone who has asked a single question has asked an average of 4.4 questions – there may be some odd stuff in here around anonymous questions though, haven’t looked at that yet)
  • Average Answer Count – 4
  • Average Comment Count – 2.3
  • Closed Questions – 3,656 (or 2% of all questions)
  • Average Questioner Reputation – 1,506
  • Average Questioner Age – 30 (but remember, that’s unreliable)

I’m just getting started playing with it, and I’ll have a fun new StackOverflow statistics toy available for everybody to play with in a couple of days.  In the meantime, you can download the StackOverflow database dump via BitTorrent and download my ETL stored procs.

Update: Sample StackOverflow Queries in the SQLServerPedia Wiki

Jon Skeet had an excellent idea: we need a wiki to store interesting queries.  Wouldn’t you know, I happen to run one!  I added a section in SQLServerPedia for sample StackOverflow database queries.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

The Cost of Log Shipping SQL Server to Amazon S3

Want to know how much it’ll cost to back up your database transaction logs to Amazon S3?  I’ll show you with a snippet of T-SQL.

Note the parameters at the top for retention periods.  The defaults assume one full backup weekly.  DBAs would normally want to do full backups more often, but when you’re paying by the upload, you may want to consider only uploading one full backup per week and then uploading transaction log backups the rest of the time.  You can still take full and differential backups locally for faster local recovery times as long as you set your backup parameters correctly so that you don’t break your log chain.

The @months_to_retain_full_backups parameter says we’re storing two months of full backups in S3.  This gives you some offsite flexibility while minimizing your upload costs.

The upload cost is set at $.10 per gig because that’s Amazon’s long-term price.  There’s a temporary sale going on for $.03 per gig for uploads, but you can’t set your budget by that.

With no further ado, here’s the query.  If it doesn’t work on your system, let me know – this is a really rough draft that I’ve only used on a dozen or so systems, and it wouldn’t surprise me if it’s got bugs.

DECLARE  @upload_cost_per_gb              DECIMAL(6,2),
         @storage_cost_per_gb             DECIMAL(6,2),
         @days_to_retain_log_backups      INT,
         @full_backups_uploaded_per_month INT,
         @months_to_retain_full_backups   INT
 
SET @upload_cost_per_gb = .10
SET @storage_cost_per_gb = .15
SET @days_to_retain_log_backups = 14
SET @full_backups_uploaded_per_month = 4
SET @months_to_retain_full_backups = 2
 
SELECT   bs.database_name,
         COUNT(* ) AS backups,
         SUM(bs.backup_size / 1000000000) AS transaction_backup_size_gb,
         SUM(bs.backup_size / 1000000000 * @upload_cost_per_gb) AS transaction_backup_upload_cost,
         SUM(bs.backup_size / 1000000000 * @storage_cost_per_gb / 30 * @days_to_retain_log_backups)
                   AS transaction_backup_storage_cost,
         (SELECT   TOP 1 (fbs.backup_size / 1000000000)
          FROM     msdb.dbo.backupset fbs
          WHERE    bs.database_name = fbs.database_name
                   AND fbs.type = 'D'
          ORDER BY backup_start_date DESC) AS last_full_backup_size_gb,
         (SELECT   TOP 1 (fbs.backup_size / 1000000000 * @upload_cost_per_gb
                          * @full_backups_uploaded_per_month)
          FROM     msdb.dbo.backupset fbs
          WHERE    bs.database_name = fbs.database_name
                   AND fbs.type = 'D'
          ORDER BY backup_start_date DESC) AS full_backup_upload_cost,
         (SELECT   TOP 1 (fbs.backup_size / 1000000000 * @storage_cost_per_gb
                          * @full_backups_uploaded_per_month * @months_to_retain_full_backups)
          FROM     msdb.dbo.backupset fbs
          WHERE    bs.database_name = fbs.database_name
                   AND fbs.type = 'D'
          ORDER BY backup_start_date DESC) AS full_backup_storage_cost
FROM     msdb.dbo.backupset bs
WHERE    bs.type = 'L' /* log backups only, not diffs or fulls */
         AND bs.backup_start_date &gt;= DATEADD(dd,-30,GETDATE())
GROUP BY bs.database_name
ORDER BY bs.database_name

This query revolves around transaction log backups, but if you only do incrementals, you could change that bs.type = “L” filter in the query to be “I” instead, and you’ll get the cost on incrementals.  I don’t have a query for databases that only do full backups, but frankly, if you’re only doing fulls, your data probably isn’t important enough to get internet-based online backups available.

This query is good for more than just gauging costs, too: keep in mind that your internet connection will need to be able to handle uploading your backups out to Amazon S3.

This is only a ballpark estimate. There are other costs that this query does not cover.  For example, Amazon charges you for some types of requests, like checking directory contents.  These costs can vary, but they’re generally very small relative to the bandwidth and storage costs.  This query also doesn’t check for backup compression, but if you’re paying by the byte for bandwidth and storage, you should seriously consider using backup compression.  The ROI is extremely high: software like Quest LiteSpeed pays for itself here in no time.

If this topic interests you, I’ll be discussing it in much more detail during my “Log Shipping To The Cloud” session at the SSWUG Virtual Conference on Thursday.  (Use VIP code SPVBOZSP09 for $10 off the $125 entry price.)  If you register for the V-Conference, you can watch the sessions whenever it’s convenient for you – not just Thursday.  However, if you happen to watch it during the first showing on Thursday, I’ll be online in the chat room for the session and you can ask me questions live.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Getting the most recent record

Sounds simple, right?  Just grab the max.  But what if you want to use a single T-SQL operation to fetch the most recent record and some of its attributes?

StorminSpank asked:

@BrentO SQL Question. 4 columns, ID, Date, Info1, Info2. ID has multiple entries. Want Latest Date for Distinct ID, but also info1 and 2.

Let’s say our table schema is:

CREATE TABLE [dbo].[TestTable](
	[id] [int] NOT NULL,
	[create_date] [date] NOT NULL,
	[info1] [varchar](50) NOT NULL,
	[info2] [varchar](50) NOT NULL,
 CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
	[id] ASC, [create_date] ASC
) ON [PRIMARY]
) ON [PRIMARY]
GO

And let’s populate it with some data:

INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (1, '1/1/09', 'Blue', 'Green')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (1, '1/2/09', 'Red', 'Yellow')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (1, '1/3/09', 'Orange', 'Purple')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (2, '1/1/09', 'Yellow', 'Blue')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (2, '1/5/09', 'Blue', 'Orange')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (3, '1/2/09', 'Green', 'Purple')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (3, '1/8/09', 'Red', 'Blue')

Now let’s get out the most recent record for each ID:

SELECT tt.*
  FROM dbo.TestTable tt
    LEFT OUTER JOIN dbo.TestTable ttNewer
    ON tt.id = ttNewer.id AND tt.create_date &lt; ttNewer.create_date
  WHERE ttNewer.id IS NULL

That left outer join is looking for any newer TestTable records, but the where clause makes sure there aren’t any. If you switch it around to “IS NOT NULL”, you’d get the exact opposite – you’d get all of the older records.

And remember, ladies and gentlemen, do as I say and not as I do – never SELECT *.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Using aliases in multi-table queries

When joining multiple tables together in a query, use aliases every time on every field.

Say we’ve got two tables, Customers and Salespersons.  The Customers table has a PreferredSalespersonID field that identifies who their normal sales rep is, and that lets us quickly grab the right salesperson when a customer calls in.  We want to display basic information about the customer and their salesperson:

SELECT CompanyName, Address, City, StateID, ZipCode, SalespersonName
FROM dbo.Customers
INNER JOIN dbo.Salespersons ON Customers.PreferredSalespersonID = Salespersons.ID
WHERE Customers.ID = 12345

That works great at first, so we put it in production and off we go.  Months later, someone decides to add an Address field to the Salespersons table.  Boom goes the dynamite – suddenly this query starts failing because SQL Server isn’t sure which City field we mean. We get errors like this:

Msg 209, Level 16, State 1, Line 3
Ambiguous column name 'Address'.

To prevent this problem, always alias every single field.  My personal preference is to use an alias as short as possible, but long enough to explain what the alias points to.  Never alias tables with single letters starting with A, B, C, etc – it’s painful to read those queries.  Instead, consider aliasing the query like this:

SELECT cust.CompanyName, cust.Address, cust.City, cust.StateID, cust.ZipCode, sls.SalespersonName
FROM dbo.Customers cust
INNER JOIN dbo.Salespersons sls ON cust.PreferredSalespersonID = sls.ID
WHERE cust.ID = 12345

If I used one-letter aliases like “c” and “s”, it would still work, but I try to design every query as if I’m going to have to come back to it tomorrow and add three more tables.  If I come back and join more tables in, then the “c” and “s” might be confusing if those new tables also start with C or S.

Finally, when picking aliases, check other stored procedures and views to see if there’s already an aliasing standard in the database.  If the Customers table is already being aliased with “cmr” in other T-SQL code, then reuse that same alias even if it doesn’t make perfect sense.  The more consistent the code, the easier it is to jump from one T-SQL script to another with ease.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

When to Modify Somebody Else’s Code

Tim Ford, aka SQLAgentman, posed a question: when you think you know why something’s not working, when do you modify somebody else’s code, and when do you refuse?  I know, I know!  I’ve got the perfect answer – “It Depends.”

Vendor Apps That Are Under Paid Support

I never touch vendor applications when I’m paying maintenance contracts.  It’s not that I want them to do their job – although I do.  It’s not because I’m lazy – although I am.  The reason is simple: I don’t want to be thrown under the bus later.

If you’re working with an application and you think the database schema was designed by monkeys banging on typewriters, don’t even think about changing the database schema yourself.  If you do, guess what happens when the next version or service pack arrives?  The guys who designed the schema are probably the same guys who designed the installer.  I’ve seen installers that run T-SQL scripts that absolutely depend on the schema being completely identical to the way it was last time.  If there’s any variations, the installer throws an error, and you’d better believe they’re not handling errors correctly either.  It leaves your database in a completely messy state, totally unsupported by the vendor, and then the blame game starts.  You’ll get the shaft, because you tweaked their schema.

Nobody will remember how you made the database faster.  Everybody will remember the DBA who couldn’t keep his hands out of somebody else’s code, and borked it.  Doesn’t matter if you’re right: you’re only as good as your customers think you are.

Vendor Apps That Aren’t Under Maintenance Contracts

Maybe the vendor’s gone belly-up, and you’re just trying to keep the product running.  In that case, sure, modify the schema.  You’re going to own the support yourself anyway. Just make a rollback script right then and there before you do it, and save the rollback script in case something else goes wrong later.

Maybe your company’s about to go belly-up, and you’re not paying maintenance on anything.  In that case, sure, modify the schema, because you gotta do whatever it takes to keep the trains running, or you’ll be laid off for not playing along.

In-House Applications

If the development team consists of people on your payroll, then this is your chance to show value within the company.  Quickly document:

  • What needs to be done
  • Why it will improve performance
  • What you’d do if this DOESN’T improve performance
  • How long it’ll take to do
  • When it can be done (right now, after hours, only at month end, etc)
  • Who can do it (but not who SHOULD do it)

Lay these points out as quickly as possible, in as short of an email as possible, and send it to the product managers.  For that last bullet point, I usually say, “I’m totally fine with doing this myself, but I wanted to communicate it to the team in case someone else would like to do it or wants to sit with me while I do it in order to learn more.”

You want to be as friendly and upbeat as possible, because you’re basically pointing the finger at somebody’s baby and calling it ugly.  Be a part of the solution, not a part of the precipitate.

Open Source Applications

This process is almost guaranteed to fail long term, but it works short term:

  1. Modify open source code
  2. Put it into production in your environment
  3. Try to get it into the trunk branch (the public repository of modifications)

This process is almost guaranteed to work long term, but it can take a while:

  1. Modify open source code
  2. Try to get it into the trunk branch
  3. After it’s released, put it into production in your environment

Try the first route, and you might fork the code and be forever patching your stuff.  I’m a fan of set-it-and-forget-it.  Mostly the forget-it part – I’m a really big fan of that.  So I’ve been burned by tweaking open source code, forgetting about my changes, and then upgrading to the latest version.  Surprise, my stuff is borked, and there’s nobody to blame but me.

Who I’m Tagging

I’m curious to hear what StatisticsIO and Kevin3NF think, especially Jason since he’s done open source work.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Dump Your Twitter Friends with Tweet-SQL

I’ve been using Twitter for a while, and I decided to prune the number of people I’m following.

But before I start going through my list of friends by hand, why not strip out everybody who hasn’t been active in a certain amount of days?

To help me do that, I’ll be using Tweet-SQL.  It’s a set of stored procedures you can use to call the Twitter API from Microsoft SQL Server using plain T-SQL code.  In today’s examples, I’ve got T-SQL code that will:

  • Call the Twitter API to get the list of people I’m following (my “friends” in Twitter terms)
  • For each friend, fetch their last updates
  • Generate a web page to help me unfollow people

Buckle your seatbelts!

Our Twitter Cache Database Schema

If you read my award-winning (not really) blog post on how to use Tweet-SQL to import a Twitter feed, this schema is going to look pretty familiar.  I’ve started with the same two tables for Users and Statuses, but I’m adding a new table called UserFollows to hold the list of users that a user is following.

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Statuses](
	[id] [INT] NOT NULL,
	[created_at] [datetime] NOT NULL,
	[text] [nvarchar](200) NOT NULL,
	[SOURCE] [nvarchar](160) NOT NULL,
	[truncated] [bit] NOT NULL DEFAULT ((0)),
	[in_reply_to_status_id] [INT] NULL,
	[in_reply_to_user_id] [INT] NULL,
	[favorited] [bit] NOT NULL DEFAULT ((0)),
	[user_id] [INT] NOT NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_Statuses_cached_date]
        DEFAULT (getutcdate()),
 CONSTRAINT [PK_Statuses] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
 
CREATE TABLE [dbo].[Users](
	[id] [INT] NOT NULL,
	[name] [nvarchar](50) NULL,
	[password] [nvarchar](50) NULL,
	[screen_name] [nvarchar](50) NULL,
	[location] [nvarchar](200) NULL,
	[description] [nvarchar](160) NULL,
	[profile_image_url] [nvarchar](200) NULL,
	[url] [nvarchar](200) NULL,
	[protected] [bit] NULL,
	[followers_count] [INT] NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_Users_cached_date]  DEFAULT (getutcdate()),
	[cached_friends_status_id] [INT] NULL,
	[cached_status_id] [INT] NULL,  --New For This Demo!
	[cached_status_date] [datetime] NULL,  --New For This Demo!
 CONSTRAINT [PK_Users] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
 
CREATE TABLE [dbo].[UsersFriends](
	[id] [INT] NOT NULL,
	[user_id] [INT] NOT NULL,
	[friend_user_id] [INT] NOT NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_UsersFriends_cached_date]  DEFAULT (getutcdate()),
 CONSTRAINT [PK_UsersFriends] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Our code is going to be driven by our Twitter ID# – a number Twitter assigns each user – and it will rely on having our information already in the Users table.  The easiest way to do this is to follow the steps in my last Tweet-SQL blog post about fetching your timeline, but if you’re too lazy, here’s a quick script to do it:

EXEC dbo.tweet_cfg_resultset_send 1;
EXEC dbo.tweet_usr_show 'brento', NULL, NULL;

Those two stored procedures will configure Tweet-SQL to return recordsets, and then go fetch your user information.  The first ID# in the first recordset is your Twitter ID#.

Get Our Friends from the Twitter API

Now that we’ve got our schema set up, we can call the Twitter API to get our list of friends and populate it into the UsersFriends table.  Here’s the code to do it, a stored procedure I like to call usp_ImportFriends:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[usp_ImportFriends]') AND TYPE IN (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ImportFriends]
GO
CREATE PROCEDURE [dbo].usp_ImportFriends @import_users_id INT
AS
  SET NOCOUNT ON
 
  DECLARE  @xml                                XML
           ,@handle                            INT
           ,@page                              INT
           ,@optional                          NVARCHAR(50)
           ,@RowsReturned                      INT
           ,@new_cached_friends_status_id INT
           ,@cached_friends_status_id     INT
 
  /* Table to join statuses and users */
  DECLARE  @imported_statuses  TABLE(
                                     id                      INT
                                     ,created_at             DATETIME
                                     ,text                   NVARCHAR(200)
                                     ,SOURCE                 NVARCHAR(160)
                                     ,truncated              BIT
                                     ,in_reply_to_status_id  INT
                                     ,in_reply_to_user_id    INT
                                     ,favorited              BIT
                                     ,user_id                INT
                                     ,user_name              NVARCHAR(50)
                                     ,user_screen_name       NVARCHAR(50)
                                     ,user_location          NVARCHAR(200)
                                     ,user_description       NVARCHAR(160)
                                     ,user_profile_image_url NVARCHAR(200)
                                     ,user_url               NVARCHAR(200)
                                     ,user_protected         BIT
                                     ,user_followers_count   INT
                                     )
 
  SET @page = 1;
  SET @optional = '?page=' + CAST(@page AS NVARCHAR(4));
 
  /* Turn off resultsets from Tweet-SQL because we'll be using XML data */
  EXEC dbo.tweet_cfg_resultset_send
    0;
 
  /* Get the first set of statuses */
  WHILE (@page &lt;= 20)
    BEGIN
      EXEC dbo.tweet_usr_friends
        @import_users_id ,
        @optional ,
        @xml OUTPUT;
 
      EXEC sp_xml_preparedocument
        @handle OUTPUT ,
        @xml;
 
      /* Copy the XML data into a single table to make handling easier */
      INSERT INTO @imported_statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id
                  ,user_name
                  ,user_screen_name
                  ,user_location
                  ,user_description
                  ,user_profile_image_url
                  ,user_url
                  ,user_protected
                  ,user_followers_count)
      SELECT DISTINCT id
                      ,dbo.tweet_fnc_dateconvert(created_at)
                      ,text
                      ,SOURCE
                      ,truncated
                      ,in_reply_to_status_id
                      ,in_reply_to_user_id
                      ,favorited
                      ,user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   OPENXML (@handle, '/users/user', 2)
                WITH (created_at             NVARCHAR(50)  'status/created_at',
                      id                     INT           'status/id',
                      text                   NVARCHAR(200) 'status/text',
                      SOURCE                 NVARCHAR(200) 'status/source',
                      truncated              NVARCHAR(5)   'status/truncated',
                      in_reply_to_status_id  INT           'status/in_reply_to_status_id',
                      in_reply_to_user_id    INT           'status/in_reply_to_user_id',
                      favorited              NVARCHAR(5)   'status/favorited',
                      user_id                INT           'id',
                      user_screen_name       NVARCHAR(200) 'screen_name',
                      user_name              NVARCHAR(200) 'name',
                      user_location          NVARCHAR(200) 'location',
                      user_description       NVARCHAR(200) 'description',
                      user_profile_image_url NVARCHAR(200) 'profile_image_url',
                      user_url               NVARCHAR(200) 'url',
                      user_protected         NVARCHAR(5)   'protected',
                      user_followers_count   INT           'followers_count')
 
      EXEC sp_xml_removedocument
        @handle
 
      /* Insert new members that haven't been cached yet.
         The not-exists clause at the end is to avoid adding duplicate users. */
      INSERT INTO dbo.Users
                 (id
                  ,name
                  ,screen_name
                  ,location
                  ,description
                  ,profile_image_url
                  ,url
                  ,protected
                  ,followers_count)
      SELECT DISTINCT imp.user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Users u
               ON imp.user_id = u.id
      WHERE  u.id IS NULL
             AND NOT EXISTS (SELECT *
                             FROM   @imported_statuses impNewer
                             WHERE  imp.user_id = impNewer.user_id
                                    AND imp.id &lt; impNewer.id)
 
      /* Import statuses that haven't been cached yet */
      INSERT INTO dbo.Statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id)
      SELECT DISTINCT imp.id
                      ,imp.created_at
                      ,imp.text
                      ,imp.SOURCE
                      ,imp.truncated
                      ,imp.in_reply_to_status_id
                      ,imp.in_reply_to_user_id
                      ,imp.favorited
                      ,imp.user_id
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Statuses s
               ON imp.id = s.id
      WHERE  s.id IS NULL
             AND imp.id IS NOT NULL
 
      SET @RowsReturned = (SELECT COUNT(* )
                           FROM   @imported_statuses)
      SET @page = @page + 1
      SET @optional = '?page=' + CAST(@page AS NVARCHAR(4));
 
      /* If we retrieved less than 100 rows, then we hit the last page. */
      IF @RowsReturned &lt; 100
        BEGIN
          SET @page = 999
        END
 
      /* Empty out the holding table for the next pass */
      DELETE @imported_statuses
    END
 
  /* Update the Users table with the most recently cached record. */
  UPDATE dbo.Users
    SET cached_friends_status_id = @new_cached_friends_status_id
    WHERE id = @import_users_id
 
  SET NOCOUNT OFF

When you execute this stored procedure, call it with your ID# that we fetched earlier.  Here’s how I do it, since my ID# is 495643:

EXEC dbo.usp_ImportFriends 495643;

This fetches all of your friends, but be aware that it will burn one Twitter API request for each 100 of your friends.  If you’ve got 500 friends, that’s 5 API calls.  There’s a limit to how many API calls you can make per hour.  To get around that, you can request to be whitelisted for the Twitter API, meaning you can call it as often as you want. (My first request got denied, putting in my second one now with more info.)

Twitter has implemented a limit of 2,000 friends.  Some users have been grandfathered in with higher friend limits: if you’re one of those folks, you probably don’t care how many friends you have, but be aware that this stored proc will only fetch up to 2,000 friends, and only call the API up to 20 times.

Problem: Private User Updates Aren’t Included

After running that usp_ImportFriends stored procedure, run this query to show what users haven’t been active:

SELECT u.name, u.screen_name, u.protected,
  (SELECT TOP 1 created_at FROM dbo.Statuses s
   WHERE u.id = s.user_id ORDER BY s.created_at DESC) AS last_status_update
  FROM dbo.Users u
  ORDER BY 4

You’ll find that users who have protected their Twitter stream won’t show a recent status update. Dang! Well, for the purpose of this demo, we’re going to assume that protected users are worth following, and we won’t include them in our analysis.

Finding The Dead Wood

Let’s query the database to fetch user information, and we’ll order it by the time the user sent their last update:

SELECT u.name, u.screen_name, u.location, u.description, u.profile_image_url,
    u.url, u.followers_count, ('http://twitter.com/' + screen_name) AS twitter_url,
	(SELECT TOP 1 created_at FROM dbo.Statuses s
     WHERE u.id = s.user_id ORDER BY s.created_at DESC) AS last_status_update
  FROM dbo.Users u
  WHERE u.protected = 0
  ORDER BY 9
Twitter People To Unfollow

Twitter People To Unfollow, SQL Style

My results are shown at right, and just by glancing at them, I can see that I could easily dump a bunch of people.  For example, @HurricaneIke doesn’t really matter to me anymore, and I can see why that account hasn’t been updating lately.

I could write a T-SQL script to simply unfollow people who haven’t updated in X days, but there are some people in the list that I want to keep no matter what.  If @cwgabriel of Penny Arcade starts tweeting again, I wanna know.

Instead, what I really need is a web page that I can click on when I want to unfollow them.

I know Classic ASP, and I could write a page that would query SQL Server to build the result set, but we’re knee-deep in T-SQL here and it’s time to show off a really ugly T-SQL hack.  Let’s build the whole thing entirely with T-SQL.

Building the Web Page

This last query takes the results from our database and formats them HTML style.  This is horrendous code, and you should never ever do anything like this in production.  Don’t generate HTML inside the database.  It’s a bad idea for a million reasons.

Having said that, let’s do it!  Run this query:

SELECT ‘<p><img src=”‘ + u.profile_image_url + ‘” width=”40″ height=”40″ align=”left”>’
+ ‘<a href=”http://twitter.com/’ + u.screen_name + ‘” target=”_blank”>’ + u.screen_name + ‘</a><br />’
+ u.location + ‘<br />’
+ u.description + ‘<br />’
+ ‘URL: <a href=”‘ + u.url + ‘” target=”_blank”>’ + u.url + ‘</a><br />’
+ ‘Followers: ‘ + CAST(u.followers_count AS VARCHAR(20)) + ‘<br />’
+ ‘Last Update: ‘ + COALESCE((SELECT TOP 1 CAST(created_at AS VARCHAR(20)) FROM dbo.Statuses s
WHERE u.id = s.user_id ORDER BY s.created_at DESC),’Never’) + ‘</p><hr>’
FROM dbo.Users u
WHERE u.protected = 0
ORDER BY (SELECT TOP 1 created_at FROM dbo.Statuses s
WHERE u.id = s.user_id ORDER BY s.created_at DESC)

People to Unfollow

People to Unfollow, HTML Style

You’ll notice that this is not cleanly formatted like the rest of my SQL code examples.  That’s because WordPress, my blog software, starts choking when I mix T-SQL code with HTML in my code examples.  Heck, so does my SQL Server!

When you run that query, you’ll get a list of records that don’t mean much to you.

Open your favorite text editor and start a new file called MyFriends.html.  Put <html><body> at the top on a new line, then copy/paste your results from SQL Server into the text file.  Then add </body></html> at the end of the file, save it, and preview it with your web browser by double-clicking on the file in Windows Explorer.  If you’re lucky, it’ll look like my example at right.

Their name links to their Twitter page, and it opens in a new window so that you can race through the list unfollowing people.

Happy pruning!

By remixing the “ORDER BY” clause of the query, you can also sort by u.followers_count descending.  I find that early in my Twittering, I followed some people who seemed cool just because they were huge, like @scobleizer, but as I get more used to Twitter, I find those people less interesting.

Wondering Who’s Unfollowing You?

If you’re curious about who unfollows you, there’s a service called Qwitter that will email you whenever someone unfollows you.  I’ve had mixed luck with it – it hasn’t sent me an email in months, and I refuse to believe I’m that magnetic.

When Qwitter emails you, it includes the tweet that caused them to unfollow you – the straw that broke the camel’s back, if you will.  Don’t put too much faith into that, because it has to do with the frequency that Qwitter runs.  They might have unfollowed you several hours prior to that particular tweet.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Getting Your Twitter Feed into SQL Server with Tweet-SQL

If you’re reading this blog, you’re probably a SQL Server DBA or developer, and you’re probably into Twitter, so you’re probably going to be interested in Rhys Campbell‘s Tweet-SQL.

Tweet-SQL is a set of SQL Server stored procedures that let you work with the Twitter API just by writing T-SQL code. You can do things like:

  • Get the list of people you’re following into a database
  • Retrieve their messages into a table
  • Analyze who’s following who, and figure out who you should be following
  • Compile metrics about what’s going on out there

There are public services like MrTweet that let you do some of this, but being a DBA, I wasn’t quite satisfied with those. There’s also a free SSIS Twitter Suite by Andy Leonard and Jessica Moss, and I started playing around with that, but I suck at SSIS. I’m pretty good with T-SQL, though, and because I didn’t want to spread my skill set any thinner, I just ate the cost of Tweet-SQL and got started.

Download and Install Tweet-SQL

To get started, you’ll need a 30-day eval copy of Tweet-SQL. It’s free to evaluate, and costs £25 at the time of this writing – roughly $37 US.

Evaluate it before you buy it though, because I’ve had some difficulties getting it up and running on some workstations, especially those where I’m running SQL 2008 under a non-admin domain account on Windows 2008. I’ve ended up running Tweet-SQL in its own virtual machine with Windows 2003 and SQL Server 2005 just to keep things simple.

During setup, Tweet-SQL asks for a username and password – that’s your Twitter information, which is used to connect to Twitter. Before you click OK, click Tweet-SQL at the top of the config screen, and click Deploy. That actually deploys the Tweet-SQL stored procedures to the database of your choosing.

Calling the Twitter API with T-SQL

Start a new query in the database where you deployed Tweet-SQL and execute this code:

EXEC dbo.tweet_cfg_resultset_send 1;
EXEC dbo.tweet_sts_friends_timeline NULL, NULL;
TweetSQL Status Timeline

TweetSQL Status Timeline

The first line enables recordset outputs (instead of XML) and the second line fetches our friends’ timeline from Twitter. You should see something like the window pictured at right – you can click on that to zoom in.

Ignore the first recordset that only has one line in it. The good stuff is in the second two recordsets.

The second recordset lists the status updates. Each line is a single status update. Note the “status_Id” column, second to last one – it starts at 0 and goes up.

The third recordset lists the users who made each status update. It joins to the second recordset on the “status_Id” column. Yes, in a perfect world, these would already be joined together for us, but we’re big bad database people, and that’s not a problem for us. Let’s create some tables to store these results.

Our Twitter Cache Database Schema

In order to pull our data into a database, we’re going to need some tables. For today’s lesson, we’re only going to create two tables:

  • Users – stores the list of people we’ve seen on Twitter
  • Statuses – stores the things they’ve said.

The Twitter API has a call that gets our status timeline, and it includes some basic information about each user. We’re going to kill two birds with one stone: whenever someone tweets, we’re going to store their user profile data in the User table too. There are separate Twitter API calls that will fetch our followers and our friends, but let’s learn to crawl first.

Here’s the script to create the tables. Be aware that I have zero clue what the actual field types are – these may need to be changed over time from nvarchar(200) to nvarchar(240), for example. (Yes, believe it or not, Twitter fields can be longer than 140 characters, as we’ll show later.)

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Statuses](
	[id] [INT] NOT NULL,
	[created_at] [datetime] NOT NULL,
	[text] [nvarchar](200) NOT NULL,
	[SOURCE] [nvarchar](160) NOT NULL,
	[truncated] [bit] NOT NULL DEFAULT ((0)),
	[in_reply_to_status_id] [INT] NULL,
	[in_reply_to_user_id] [INT] NULL,
	[favorited] [bit] NOT NULL DEFAULT ((0)),
	[user_id] [INT] NOT NULL,
	[cached_date] [datetime] NOT NULL CONSTRAINT [DF_Statuses_cached_date]
        DEFAULT (getutcdate()),
 CONSTRAINT [PK_Statuses] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
 
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Users](
	[id] [INT] NOT NULL,
	[name] [nvarchar](50) NULL,
	[password] [nvarchar](50) NULL,
	[screen_name] [nvarchar](50) NULL,
	[location] [nvarchar](200) NULL,
	[description] [nvarchar](160) NULL,
	[profile_image_url] [nvarchar](200) NULL,
	[url] [nvarchar](200) NULL,
	[protected] [bit] NULL,
	[followers_count] [INT] NULL,
	[cached_date] [datetime] NOT NULL 
	  CONSTRAINT [DF_Users_cached_date]  DEFAULT (getutcdate()),
	[cached_friends_status_id] [INT] NULL,
 CONSTRAINT [PK_Users] PRIMARY KEY CLUSTERED
(
	[id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF,
  IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON,
  ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

I’m not doing any fancy indexing here, just the bare minimum to get started.

The Logistics of Importing Twitter Data into SQL Server

Before we go wild and crazy with the code, let’s talk through it in pseudocode. Here’s what we need to accomplish:

  • Call the Twitter API to get our most recent messages in XML format
  • Store that data in a temporary table for faster juggling
  • Insert users into our Users table if they don’t already exist
  • Insert statuses into our Status table if they don’t already exist

We could just write this as a stored procedure, and then call it as a SQL Server Agent job every minute. However, the Twitter API has a couple of limitations that we need to watch out for.

The API only sends up to 200 results back at a time. If we call this every 15 minutes for users who are following a lot of people, we could easily lose messages if more than 200 come through in 15 minutes. We could work around that by calling the API every minute or so, but that leads into our second challenge: we’re only allowed to call the API a certain number of times per hour.

To help work around this, the Twitter API has an extra parameter: we can specify a status ID#, and Twitter can return just the new messages since that ID#. This way, I can call the API once per hour, specify my most recent cached status #, and I’ll only get the new messages. I can also call it with a “page” parameter, so we can get several groups of 200 results one page at a time until we’ve exhausted all of the data.

On our Users table, there’s a cached_friends_status_id field: that’s where we’re going to store the most recent status ID. Why wouldn’t we just query for the max id in the Statuses table? I’m glad you asked, because that shows that you’re smart enough to keep an eye out for Lesson 2, where we’ll get into more advanced uses of these tables.

With that in mind, here’s what our pseudocode will look like:

  • Query the Users table to find the most recent cached status ID for our user
  • Call the Twitter API to get our most recent statuses since the most recent cached status ID
  • Store that data in a temporary table for faster juggling
  • Insert users into our Users table if they don’t already exist
  • Insert statuses into our Status table if they don’t already exist
  • If we retrieved 200 results, there’s probably more, so go back to step 2, but ask for the next page of results.
  • Update the Users table with the most recent cached status ID

The Stored Procedure to Import Twitter Data

With no further ado, here’s the stored procedure:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF  EXISTS (SELECT * FROM sys.objects
  WHERE object_id = OBJECT_ID(N'[dbo].[usp_ImportStatusesFromFriends]')
  AND TYPE IN (N'P', N'PC'))
DROP PROCEDURE [dbo].[usp_ImportStatusesFromFriends]
GO
CREATE PROCEDURE [dbo].[usp_ImportStatusesFromFriends] @import_users_id INT = NULL
AS
  SET NOCOUNT ON
 
  DECLARE  @xml                                XML
           ,@handle                            INT
           ,@page                              INT
           ,@optional                          NVARCHAR(50)
           ,@RowsReturned                      INT
           ,@new_cached_friends_status_id INT
           ,@cached_friends_status_id     INT
 
  /* Table to join statuses and users */
  DECLARE  @imported_statuses  TABLE(
                                     id                      INT
                                     ,created_at             DATETIME
                                     ,text                   NVARCHAR(200)
                                     ,SOURCE                 NVARCHAR(160)
                                     ,truncated              BIT
                                     ,in_reply_to_status_id  INT
                                     ,in_reply_to_user_id    INT
                                     ,favorited              BIT
                                     ,user_id                INT
                                     ,user_name              NVARCHAR(50)
                                     ,user_screen_name       NVARCHAR(50)
                                     ,user_location          NVARCHAR(200)
                                     ,user_description       NVARCHAR(160)
                                     ,user_profile_image_url NVARCHAR(200)
                                     ,user_url               NVARCHAR(200)
                                     ,user_protected         BIT
                                     ,user_followers_count   INT
                                     )
 
  /* Get the last cached status ID if it's available */
  SELECT @cached_friends_status_id = cached_friends_status_id
    FROM dbo.Users
    WHERE id = @import_users_id
  SET @page = 1;
  SET @optional = '?count=200&amp;page=' + CAST(@page AS NVARCHAR(4));
 
  IF @cached_friends_status_id IS NOT NULL
    BEGIN
      SET @optional = @optional + '&amp;since_id='
        + CAST(@cached_friends_status_id AS NVARCHAR(30))
    END
 
  /* Turn off resultsets from Tweet-SQL because we'll be using XML data */
  EXEC dbo.tweet_cfg_resultset_send
    0;
 
  /* Get the first set of statuses */
  WHILE (@page &lt;= 10)
    BEGIN
      EXEC dbo.tweet_sts_friends_timeline
        @optional ,
        @xml OUTPUT;
 
      EXEC sp_xml_preparedocument
        @handle OUTPUT ,
        @xml;
 
      /* Copy the XML data into a single table to make handling easier */
      INSERT INTO @imported_statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id
                  ,user_name
                  ,user_screen_name
                  ,user_location
                  ,user_description
                  ,user_profile_image_url
                  ,user_url
                  ,user_protected
                  ,user_followers_count)
      SELECT DISTINCT id
                      ,dbo.tweet_fnc_dateconvert(created_at)
                      ,text
                      ,SOURCE
                      ,truncated
                      ,in_reply_to_status_id
                      ,in_reply_to_user_id
                      ,favorited
                      ,user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   OPENXML (@handle, '/statuses/status/user', 2)
                WITH (created_at             NVARCHAR(50)  '../created_at',
                      id                     INT           '../id',
                      text                   NVARCHAR(200) '../text',
                      SOURCE                 NVARCHAR(200) '../source',
                      truncated              NVARCHAR(5)   '../truncated',
                      in_reply_to_status_id  INT           '../in_reply_to_status_id',
                      in_reply_to_user_id    INT           '../in_reply_to_user_id',
                      favorited              NVARCHAR(5)   '../favorited',
                      user_id                INT           'id',
                      user_screen_name       NVARCHAR(200) 'screen_name',
                      user_name              NVARCHAR(200) 'name',
                      user_location          NVARCHAR(200) 'location',
                      user_description       NVARCHAR(200) 'description',
                      user_profile_image_url NVARCHAR(200) 'profile_image_url',
                      user_url               NVARCHAR(200) 'url',
                      user_protected         NVARCHAR(5)   'protected',
                      user_followers_count   INT           'followers_count')
 
      EXEC sp_xml_removedocument
        @handle
 
      /* Insert new members that haven't been cached yet.
         The not-exists clause at the end is to avoid adding duplicate users. */
      INSERT INTO dbo.Users
                 (id
                  ,name
                  ,screen_name
                  ,location
                  ,description
                  ,profile_image_url
                  ,url
                  ,protected
                  ,followers_count)
      SELECT DISTINCT imp.user_id
                      ,user_name
                      ,user_screen_name
                      ,user_location
                      ,user_description
                      ,user_profile_image_url
                      ,user_url
                      ,user_protected
                      ,user_followers_count
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Users u
               ON imp.user_id = u.id
      WHERE  u.id IS NULL
             AND NOT EXISTS (SELECT *
                             FROM   @imported_statuses impNewer
                             WHERE  imp.user_id = impNewer.user_id
                                    AND imp.id &lt; impNewer.id)
 
      /* Import statuses that haven't been cached yet */
      INSERT INTO dbo.Statuses
                 (id
                  ,created_at
                  ,text
                  ,SOURCE
                  ,truncated
                  ,in_reply_to_status_id
                  ,in_reply_to_user_id
                  ,favorited
                  ,user_id)
      SELECT DISTINCT imp.id
                      ,imp.created_at
                      ,imp.text
                      ,imp.SOURCE
                      ,imp.truncated
                      ,imp.in_reply_to_status_id
                      ,imp.in_reply_to_user_id
                      ,imp.favorited
                      ,imp.user_id
      FROM   @imported_statuses imp
             LEFT OUTER JOIN dbo.Statuses s
               ON imp.id = s.id
      WHERE  s.id IS NULL
 
      /* Update the Groups with the most recent messageid */
      SET @RowsReturned = (SELECT COUNT(* )
                           FROM   @imported_statuses)
      IF @page = 1
         AND @RowsReturned &gt;= 1
        BEGIN
          SET @new_cached_friends_status_id = (SELECT   TOP 1 id
                                                    FROM     @imported_statuses
                                                    ORDER BY id DESC)
        END
 
      SET @page = @page + 1
      SET @optional = '?count=200&amp;page=' + CAST(@page AS NVARCHAR(4));
      IF @cached_friends_status_id IS NOT NULL
        BEGIN
          SET @optional = @optional + '&amp;since_id='
            + CAST(@cached_friends_status_id AS NVARCHAR(30))
        END
 
      /* If we retrieved less than 200 rows, then we hit the last page. */
      IF @RowsReturned &lt; 200
        BEGIN
          SET @page = 999
        END
 
      /* Empty out the holding table for the next pass */
      DELETE @imported_statuses
    END
 
  /* Update the Users table with the most recently cached record. */
  UPDATE dbo.Users
    SET cached_friends_status_id = @new_cached_friends_status_id
    WHERE id = @import_users_id
 
  SET NOCOUNT OFF

Are you still with me? Whew, good. I think we lost a few people there.

The stored proc takes an optional @import_users_id parameter – that’s your Twitter numeric user_id. You won’t know it yet, but after you run the stored proc, you can query your Users table to find yourself, assuming you posted a status update recently.

When you run the stored procedure like this:

EXEC [dbo].[usp_ImportStatusesFromFriends]

you should get these messages:

200 OK: everything went awesome.
200 OK: everything went awesome.
200 OK: everything went awesome.
200 OK: everything went awesome.
200 OK: everything went awesome.

That means, well, everything went awesome. Your Users and Statuses tables should now be populated.

At this point, you can get your user ID:

SELECT * FROM dbo.Users WHERE name = 'YourUsernameHere'

And then from that point forward, you can call the stored proc with your user id, thereby enabling the last-cached-id code.

When Things Don’t Go Awesome

I’ve been running this code for a while, and it’s very much alpha code. It may deflower your laptop at a moment’s notice.

When things break, it’s usually because the Twitter API is down or because my database schema doesn’t fully handle Twitter’s fields. For example, I’ve had several cases where they’ve delivered data that’s wider than I expected – those NVARCHAR(200) fields didn’t start out quite that wide. Over time, I’ll come back and revise the table schemas to be more closely aligned with whatever Twitter’s actually storing.

Now that I’m thinking of it, I didn’t test this code on a case sensitive server. I’m usually really good about that. Doh!

Stuff To Do When You’ve Got Twitter Data In Your Database

I’ve just scratched the surface of what you can do with Tweet-SQL. The Twitter API is pretty darned powerful, and it enables you to do things like:

  • Find out who your followers are following, which might be people you also find interesting
  • Find influential people in your Twitter network
  • Get recap reports of what links have been sent out recently, or what hot topics are out there
  • Get peoples’ updates without actually following them directly (aka stalking, hee hee ho ho)
  • Build a web-based equivalent to TweetDeck, with multiple user functionality

For sample scripts, check out SQLCLRNews, Rhys’s blog where he posts how-to examples for Tweet-SQL.

If you found this interesting, I’d suggest following rhyscampbell on Twitter, and of course, me – I’m BrentO.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Triggers Need to Handle Multiple Records

On the StackOverflow podcast this week, I caught the guys talking about a SQL Server trigger question and I immediately panicked.  I think nine out of ten times when I see a trigger, it’s written incorrectly, and sure enough, this one had dangerously incorrect answers.

Triggers: You’re Doing It Wrong

When writing a trigger, it’s easy to think in terms of individual records being inserted, updated or deleted.  Developers and DBAs say to themselves, “I’m going to write a trigger to handle the record that just got inserted.”  They write triggers like this below one, which is supposed to only insert records if SomeNumericField is actually numeric:

CREATE TRIGGER mytrigger ON sometable
INSTEAD OF INSERT
AS BEGIN
  DECLARE @isnum TINYINT;
  SELECT @isnum = ISNUMERIC(SomeNumericField) FROM inserted;
  IF (@isnum = 1)
    INSERT INTO sometable SELECT * FROM inserted;
  ELSE
    RAISERROR('SomeNumericField must be numeric', 16, 1)
      WITH SETERROR;
END

Examine that row in bold.  What will happen in the trigger if we insert multiple records simultaneously, like if we run the following T-SQL statement?

INSERT INTO sometable SELECT * FROM someothertable

And what if some rows are valid, and some rows are not?  Will we be able to predict the ISNUMERIC(SomeNumericField) result?  Probably not.  We can rewrite this two ways: either letting the good rows in, or not letting any rows in at all.

Trigger that Rejects Only the Bad Rows

Let’s rewrite it so that we insert the good rows, but reject the bad ones.  I’m just using pseudocode here – you would want to clean this up for production:

CREATE TRIGGER mytrigger ON sometable
INSTEAD OF INSERT
AS BEGIN
  --First we insert the good rows
  INSERT INTO sometable SELECT * FROM inserted WHERE ISNUMERIC(SomeNumericField) = 1

  --Then we check to see if we had any bad rows
  IF EXISTS (SELECT TOP 1 * FROM inserted WHERE ISNUMERIC(SomeNumericField) = 0)
    RAISERROR('SomeNumericField must be numeric.  Some rows were not inserted.', 16, 1)
      WITH SETERROR;
END

So this trigger inserted our good rows where SomeNumericField is numeric, and then passed a message back to our code if any records were not inserted.  This is kinda dangerous, because our application won’t know which rows succeeded and which ones failed.  Ideally, we’d track which ones failed, so let’s add code to handle that:

CREATE TRIGGER mytrigger ON sometable
INSTEAD OF INSERT
AS BEGIN
  --First we insert the good rows
  INSERT INTO sometable SELECT * FROM inserted WHERE ISNUMERIC(SomeNumericField) = 1

  --Then we insert the bad rows into a tracking table
  INSERT INTO someRejectTable SELECT * FROM inserted WHERE ISNUMERIC(SomeNumericField) = 0

  --Then we check to see if we had any bad rows
  IF EXISTS (SELECT TOP 1 * FROM inserted WHERE ISNUMERIC(SomeNumericField) = 0)
    RAISERROR('SomeNumericField must be numeric.  Some rows were not inserted.', 16, 1)
      WITH SETERROR;
END

When you design someRejectTable, all of the fields should be nvarchars, because we can’t predict what kind of data we’re going to be inserting.  After all, the whole reason we’re dumping data into here is because SomeNumericField, which was supposed to be numeric, isn’t actually numeric.

This technique works well for data warehouses where we bring data in from text files and we can’t predict what will come in.  Sometimes people change input file formats, and what used to be a numeric field now has text in it – item numbers are a great example.

Trigger that Rejects Everything if One Row is Bad

If we need to know absolutely everything is okay before we insert our data, here’s how to do it (again, this is pseudocode):

CREATE TRIGGER mytrigger ON sometable
INSTEAD OF INSERT
AS BEGIN
  --Firsts we check to see if we had any bad rows
  IF EXISTS (SELECT TOP 1 * FROM inserted WHERE ISNUMERIC(SomeNumericField) = 0)
    RAISERROR('SomeNumericField must be numeric.  Nothing was inserted.', 16, 1)
      WITH SETERROR;
  ELSE
    --We insert everything, since we know it's all good
    INSERT INTO sometable SELECT * FROM inserted

END


In this code, we check all of the rows first, then if it succeeds, we insert the data.

More Reading on Trigger Development

If this stuff comes as a surprise to you, don’t panic: there’s a really good resource on getting started with T-SQL programming that covers this important issue and a lot of other gotchas.  Check out Beginning SQL Server 2005 Programming by Robert Vieira.  You can preview it in Google Books, and buy it from Amazon.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts