Remember when you were in school, and you had a crush on someone? You would write a note asking them to be your valentine, and you’d ask a mutual friend to pass the note over to them.
The adult equivalent is linked server queries.
When your query needs to get data that’s actually stored on a totally different SQL Server, it’s tempting to use linked server queries. They’re really easy to write: after setting up the linked server, just put the server and database name at the front of your query, and SQL Server takes care of everything for you by connecting to the server that has the data you want.
I’ll demo an example using the open source Stack Overflow database. In this example, the Users table is stored locally, but I need to fetch data from the much larger Posts table, which is stored on another server.
SELECT TOP 10 u.DisplayName, p.*
FROM dbo.Users u
INNER JOIN [RemoteServer].StackOverflow.dbo.Posts p ON u.Id = p.OwnerUserId
WHERE u.UpVotes + u.DownVotes < 0
ORDER BY p.CreationDate DESC;
I’m looking for the most recent posts written by the users with a low number of votes. Performance is catastrophic, taking minutes of query runtime – and no rows are even returned. What two problems are causing this?
Problem #1: linked server queries can copy entire tables across the network.
In this query, SQL Server believes that LOTS of users are going to match, so it decides to just go start fetching rows from the Posts table across the network. Read the query plan from right to left, top to bottom – the thing at the top right was the first thing SQL Server chose to do:
The first thing SQL Server decided to do was a Remote Query – scanning the Posts table across the network, starting from the most recent posts first. The remote SQL Server sends rows, and for each row, the local SQL Server checks to see if the matching user has a low number of votes.
This plan would work great – if SQL Server’s assumptions about a large number of matching Users rows were correct. In some cases, when SQL Server’s assumptions match, queries perform just fine. (That’s really a good chunk of query tuning right there, and as I talk about in my Fundamentals of Query Tuning class, large variances between estimated and actual rows are usually the place to focus on when you’re tuning.)
Unfortunately, that assumption is not correct.
In fact, no users match our criteria.
So the local SQL Server keeps on dragging rows across the network from the remote server, checking its owner one at a time, and eventually exhausts the entire content of the Posts table. Eventually, the plan finishes, and here’s the actual plan:
The linked server query copied the entire Posts table across the network wire, one row at a time.
Yes, that’s 33 minutes and 29 seconds.
Problem #2: linked servers don’t cache data.
If you run the same linked server query a dozen times – even if the rows aren’t changing, even if the database is read-only, even if all twelve queries run at the same time, SQL Server makes a dozen different connections over to the linked server and fetches the data from scratch, every single freakin’ time.
This is another great example of a development pattern that works fine in development, especially with small database sizes – but then falls down dramatically at scale, with larger databases and more concurrent queries.
Even worse, it penalizes both servers involved with the linked server query. It’s hard on the local server, and it’s hard on the remote server that holds the single source of truth for the table.
There are many more reasons linked server queries perform poorly – but those two alone are showstoppers.
So when are linked servers okay?
They’re fine for one-off utility queries, things you only have to do a couple of times. For example, say someone really screws up a table, and you need to restore that table’s contents. Since SQL Server still can’t restore a freakin’ table, a common approach is to:
- Restore the database backup onto a different server
- Pull or push just the rows you need (or all of them) across a linked server connection
That way, you don’t risk accidentally restoring the backup over the entire production database, and you can take your time picking and choosing the specific rows you want.
Otherwise, if you need data from a different SQL Server, I’ve got some tough advice: connect to the server that has the data you want. If you need fast results, that’s the most surefire game in town.
And for the record, about that note your friend passed to me, the answer is no, I do not want to be your valentine. But if maybe you asked me directly, the answer might be different. There’s a lesson in there.