How to Tell If You’re Good At SQL Server Transactional Replication (Quiz)

Let’s play the quiz! We’re using the honor system: be honest with yourself.

Five Simple Questions

  1. When I say the first name “Hilary”, you immediately know his last name is ____.
  2. Have you successfully set up replication so that you can add and remove articles without running a snapshot for everything, and done this before? (Yes or No). _____
  3. Do you have monitoring set up that notifies you immediately if a subscriber is behind? (Yes or No) ____
  4. How many books on SQL Server replication do you own (which you have read)? _____
  5. Do you know exactly how long it would take to re-initialize all your publications? (Yes or No) ____
Lego builder shutterstock_119038624

This is totally stable!

Score Your Answers

  1. One point if you thought of this man.
  2. One point for Yes
  3. One point for Yes
  4. One point if the answer is 1 or higher.
  5. One point if your answer is “I know exactly how long, and I wish I didn’t know.”

Here’s how to interpret your score:

  • 1-3 points: You’re just beginning, but you cared enough to take the quiz. You can get there.
  • 4 points: You’re good at replication, AND you’re honest ;)
  • 5 points: You’re an expert

Learn More

Ready to learn more about replication? We’ve written a lot about it.

Should I Run sp_recompile After I Create An Index?

Making index changes in SQL Server is tricky: you immediately want to know if the new index helped your performance and if it’s being used, but SQL Server execution plans and their related statistics can be are insanely confusing.

It can be useful to run sp_recompile after you create an index, but not necessarily for the reason you might think.

It’s easier to show this than just write it out.

Let’s Fix a Slow Query By Creating an Index

Let’s say the biggest, baddest query on our SQL Server is from a stored procedure called dbo.kl_RecentlyCreatedAnswers. Our free tool, sp_BlitzCache, calls this procedure out for being our #1 CPU user:

exec sp_BlitzCache @top=5, @results='narrow';

blitzcache before

I can fix that. I design and test this index, and I deploy it to production. It’s totally going to fix this bad execution plan!

ON [dbo].[Posts] ([CreationDate],[PostTypeId],[Score])

I’m excited to see how awesome things are, so I immediately run sp_BlitzCache again. But here’s what I see:

blitzcache after

Wait a second. But I just… It hasn’t even changed… What the heck, SQL Server?

Why Is My Terrible Plan Still In Cache?

Creating the index doesn’t cause SQL Server to find related plans in the cache right away and flush them out. My execution plans will hang out in the cache until one of these things happens:

  • The query runs again. SQL Server sees that the schema on the table has changed and decides it needs to reconsider what to do, it recompiles the execution plan*.
  • The old plan gets “aged” out of cache if it isn’t used again (maybe pretty fast if there’s a bunch of memory pressure)
  • The plan is cleared by DBCC FREEPROCACHE, taking the database offline, a SQL Server restart, or a settings change that impacts the plan cache

*The fine print: Books Online lists a list of causes of recompilation here– note that creating an index on the table isn’t necessarily guaranteed by the list. However, the amazing Nacho Portillo recently blogged on this after looking at the source code and indicates that creating an index does flip a ‘schema changed’ bit that should reliably always trigger a recompile. He also mentions that there’s really no way to query all the plans that are still in the cache but are basically ‘invalidated’ due to the metadata change. Sorry, rocket scientists.

But My Plan Is STILL In Cache. Sort Of. Remember When I Said This Was Confusing?

Once the query runs again, I see something different. It did automatically decide to use my new index!

blitzcache after after

Wait a second. Something’s weird. Compare the average executions and CPU for the stored procedure (line 1) and the statement in it (line 2). They don’t seem to match up, do they?

Here’s what happened: the stored procedure ran again. The statement detected the schema change and recompiled. But the *whole* stored procedure didn’t recompile, and it’s showing me stats for 13 executions (not just the 10 since the index change). So my old performance metrics are all mixed up with my new performance metrics. I’m not loving that.

sp_recompile Can Help

Confusing, right? Because of this issue, you might want to run sp_recompile against the stored procedure after making an index change, even if it decided to use it. This forces the whole procedure to get a fresh plan and start collecting fresh execution statistics the next time it runs.

You could also take a heavier hand and run sp_recompile against the whole table, but do that with care: it requires schema level locks and can cause long blocking changes if lots of queries are reading and writing from that table.

Remember: even with sp_recompile, the execution plan stays in cache until it runs again (or is evicted for other reasons). The benefit is just that it will give you a “fresher” view of the execution stats for the whole stored procedure.

Fact: It’s a Little Messy

The main thing to know here is that creating indexes won’t drop or flush plans out, so don’t be surprised if you see old plans in execution plan analysis after you add indexing changes. This isn’t a completely tidy process, sometimes things are a little bit messy.

If you’re actively looking at execution plans in your cache, then running sp_recompile after you create an index can help ensure you’re looking at consistent data. But use it with care and monitor for blocking– don’t leave it unattended.

This example used a downloaded copy of the StackOverflow database. Learn how to get your own here.

Watch This Week’s Webcast Today (and Win a Prize Tomorrow)

We always like to innovate — not just with the solutions we design for our consulting customers and in how we teach, but in our free videos, too.

Our YouTube channel has become super popular. Lots of folks watch the recordings of our live webcasts. We stopped recently and asked, “How can we make this even better for the folks who attend our live event?” And we realized: we can give you more time to ask questions about that week’s training topic!

Here’s your mission:

  1. Watch the video below today. We won’t be presenting this live this week or re-covering the material from the video, we’re doing more advanced QA for the folks who’ve already watched it.
  2. Note down questions or comments you have on this post. (This is totally optional, but it means you won’t forget your question and it’s more likely we have time to talk about it with you.)
  3. Attend the live webcast on Tuesday at the normal time (11:30 am Central). Register here.
  4. During the first 10 minutes of the webcast, we’ll give away a prize– but you must be present to win!

The live discussion of the video and Q&A won’t be recorded and published, and you also need to be present to win the prize. See you on Tuesday!

Changing SQL Server Editions: Standard, Enterprise, Evaluation and More

“Enterprise Edition was installed for SQL Server, but it turns out that we only have a license for Standard Edition. Is that an easy change?”

I see this question a lot. The answer is well documented by Microsoft, but it seems to be really hard for folks to find! If you’d like to go straight to the source, everything I’m going to highlight here comes from the MSDN page Supported Version and Edition Upgrades.

Sometimes Edition Upgrades (SKUUPGRADES) are simple

If you want to make a supported edition change, it takes a little downtime but isn’t all that tricky. You run SQL Server Setup and just follow the steps in the Procedure section here.

Edition Upgrade SQL Server 2012

“How complicated can this be?” … Famous Last Words

Protip: The Edition Upgrade GUI lets you see and copy the current license key for that instance of SQL Server. (No, I’m not showing a screenshot with my key in it!)

You can also do this from the command line using the SKUUPGRADE parameter (and back in SQL Server 2005 and prior, that was your only option).

Changing the edition causes some downtime, but it’s a simple procedure. The fact that it’s relatively simple isn’t an excuse to skip testing: always run through this outside of production first so you know exactly what to expect. And always, always, always take your backups and make sure they’re on separate storage before you start. Document everything you need to know about your configuration just in case something goes wrong and you’ve got to reinstall.

It’s pretty simple. Except when it’s not supported.

What Goes Up Does Not Necessarily Come Down

The way I usually remember the rules is that you can typically change from a cheaper version to a more expensive version. But you can’t necessarily go from a more expensive version to a cheaper version.

So if you have SQL Server Enterprise Edition and you want to change to Standard Edition, a simple SKUUPGRADE isn’t going to work for you. (If you have the “Evaluation” Enterprise Edition, you’re probably OK though!) Check the chart for what you want to do to make sure.

Clusters are Special. (Not in a good way in this case.)

A lot of the confusion is around SQL Servers installed on failover clusters. You have to scroll waaaaay down on that page to see this:

The fine print on upgrading failover clusters

The fine print on upgrading failover clusters

Ouch! Changing the edition of a clustered SQL Server is not a simple thing.

While I’ve made you uncomfortable, check out KB 2547273, “You cannot add or remove features to a SQL Server 2008, SQL Server 2008 R2, or SQL Server 2012 failover cluster”.

What if I Don’t Know What Edition I Need?

Typically the answer here is to use Evaluation Edition. But if you’re running a failover cluster, be careful– as you can see above, you can’t easily change from Enterprise Evaluation to Standard Edition.

Will CHANGING THE EDITION Reset My Service Packs?

I believe this used to be true on SQL Server 2005– if you changed editions, you’d have to reapply service packs and cumulative updates afterward.

I just ran a test on SQL Server 2012 and upgraded from Developer Edition to Enterprise Edition on a test instance, and I still had version 11.0.3431 (Service Pack 1, Cumulative Update 10) after the upgrade.

But like I said, test this out with your version, even if it’s using a quick virtual environment that you don’t keep after the change has been completed successfully. There’s other real perks to doing this as well, such as making sure that your license keys really work and are the edition you think they are!

What If My Change Isn’t Supported By the GUI / Upgrade Installer?

In this case, you need to uninstall and reinstall SQL Server. It’s going to take longer and cause more downtime. You’ll have to reconfigure everything and reinstall service packs. (It’s not actually that much extra work, because you were going to take those backups and document all the special configuration just in case, right?)

What if I Can’t Take Much Downtime?

If downtime is a real issue, don’t make this change in place. Set up a new SQL instance, test the heck out of it, get it into monitoring and plan a way to migrate to it with limited downtime using something like Database Mirroring. (If you’re considering this, read that link– it mentions that mixing editions between database mirroring partners isn’t supported by Microsoft. You can’t set it up through the GUI, you have to use TSQL. If that makes you paranoid, you could do the migration with log shipping.)

The Top 3 Trickiest Features in SQL Server (Twitter Poll)

I’ve worked with a lot of features in SQL Server. I know what I think is tricky and more difficult than it looks like at first. But experiences vary, right?

So I asked the Twitterverse, “What are the Top 3 Trickiest Features in SQL Server?”  Here’s what I heard back.

#1: Replication

SQL Server Replication “wins” the top spot for being mentioned by the most people. Maybe it won because it’s touched the hearts of the most people since it works with Standard Edition. Maybe it’s just been in the product long enough to have tricked lots of us?

#2: Availability Groups

Coming in second is SQL Server Availability Groups. These may have only been with us since SQL Server 2012, but their complexity has impressed quite a few people already.

#3: DBAs

The number three place goes to a feature I hadn’t thought of myself… database administrators themselves. I laughed out loud when I saw these tweets, but, well, there’s some truth in it. We are a tricksy bunch!

Honorable Mentions

Other top tricky features that came up:

  • Service Broker (guessing they worked at MySpace)
  • Security
  • SSIS (oh, the clicking! the clicking!)
  • SQL Server Clustering
  • XML
  • CLR
  • Active Directory (ah, Kerberos authentication, you devil you)
  • Resource Governor (someone actually used Resource Governor!?!?!)
  • Extended Events
  • SAN Administrators
  • Enterprise Architects

My Personal Top 3

Turns out I’m not so different from the Twitter community. My personal top three trickiest features are: Availability Groups, Replication, and Service Broker. (I’m not really all that into queues in SQL Server, but I do like Event Notifications, which use the Broker framework.)

What are yours?

Learning By Doing: How We’re Innovating In Our Seattle Course, “Make SQL Apps Go Faster”

A team challenge in our Chicago classes in 2013

A team challenge in our training in Chicago, 2013

As teachers, we’re always working to maximize the skills that students can learn in a given amount of time.

Many students “learn by doing.” But what’s the best way to do this?

You may immediately think of lab exercises as an option. But labs are treacherous to implement in a classroom environment: ten minutes after you’ve begun, the person next to you is suffering from unexpected reboots, you’re not sure why the scripts aren’t working properly for you, and the fellow behind you is somehow already done. Even under the best of circumstances, labs don’t move at the same pace for everyone. By lunchtime, you’re bored and frustrated.

We’ve got a better way.

We’re building two very cool Challenges in our two day Seattle course, Make SQL Apps Go Faster, which will be held just prior to the SQL PASS Summit in 2014.

Challenge 1: Learn to Diagnose a Server

We work with SQL Server in two different ways: we’re consultants and teachers. The two parts of our business enrich one another. Through consulting we constantly expand our experience and understanding of the real challenges people face. In the classroom we’re always refining our methods to help people learn faster. Now we’re bringing these two things even closer together.

In our upcoming two day course in Seattle, “Make SQL Apps Go Faster”, we’re using challenges to help our students learn more and really engage with the class. For the first day, students will get dynamic management view information for a SQL Server with a problematic workload. It will contain information just like you can gather with our free tools in the real world:

Your challenge: write down what you’d do to tackle this environment.

We won’t give away all the answers. We’ll train you on these topics using slightly different examples so that you still get to figure things out on your own! Then at the end of the day we’ll go over the challenge, talk through your solutions, and compare them to our suggestions and experience working with SQL Servers around the world.

Challenge 2: Digging into TSQL, Queries, and Execution Plans

On the second day of the training, you get to specialize on query tuning challenges: you have 3 queries that you need to make faster. You’ll get the TSQL, schema, indexes, and execution plan information for all three queries and your challenge is to figure out how YOU would make these faster.

Always get involved in your learning and question everything. (Isn't that what my face says here?)

Always get involved in your learning and question everything. (Isn’t that what my face says here?)

On this day you’ll watch Kendra, Brent, and Jeremiah dig in deep to tune queries and indexes and delve into more advanced features of execution plans in SQL Server.

At the end of the day you’ll revisit the challenge. Would you make any choices differently? How do your ideas compare to the solutions we came up with? Can you make the queries even faster than we can!?!

Get involved in your training

Consulting and teaching have taught me a huge lesson: people learn better and faster when they’ve got something fun to work on. We have a blast teaching people about SQL Server — join us in Seattle and take the challenge.

Video: Test Your Index Design Skills (with poll results)

You know how to design indexes, but you’re not sure how good your skills really are. In this quiz-packed session you’ll get a chance to test your skills! Kendra Little will walk you through a set of index design challenges. You’ll have time to answer each problem on your own, then find out whether the SQL Server engine thinks you made the right choice. At the end of the session you’ll get to tally your score (but you can keep it a secret if you like).

Take the quiz while you watch the video. If you’d like to dive straight into the questions, skip to 2:30.

Compare your answers to the group

Our webcast had 416 attendees and 80% of people answered each question. To find out the right answers (and why) watch the video above, but here’s the stats on how webcast viewers answered each question.

This is NOT the answer key– don’t assume the majority of people got every question right! This just lets you compare what you thought to what others thought.

Q1) A NonClustered Index is…

A) 4.39%
B) 19.26%
C) 73.65%
D) 2.70%

Q2) What will this query probably due given this table definition?

A) 13.92%
B) 70.55%
C) 9.06%
D) 6.47%

Q3) What is most likely the biggest problem with this table?

A) 4.35%
B) 69.93%
C) 17.39%
D) 8.33%

Q4) INCLUDED columns are…

A) 14.33%
B) 77.88%
C) 5.30%
D) 2.49%

Q5) A unique constraint is…

A only) 11%
A and B) 11%
A, B, and C) 16%
A, B, C, and D) 14%
A and C) 41%
A, C and D) 8%
A and D) 4%

(There were some other combos, but they all got low percentages)

Q6) Which one of these statements is ALWAYS true?

A) 16.77%
B) 36.13%
C) 14.52%
D) 32.58%

Q7) How many indexes were written to?

0) 2.33%
1) 17.00%
2) 46.67%
3) 34.00%

Q8) Why wouldn’t SQL Server Ask for an Index?

A) 22.73%
B) 45.45%
C) 26.22%
D) 5.59%


Performance Tuning SQL Server Transactional Replication: A Checklist


Swim faster!

SQL Server transactional replication is a blessing and a curse. It’s a great developer tool that lets you scale out data to multiple servers, even using Standard Edition. But as your business picks up, your datasets get larger, and your customers grow more demanding, replication can start to fall behind. You need to learn how to tune it to keep up.

Before You Start Tuning Replication….

Make sure you can prove when changes you make to replication improve performance. Or find out quickly if you make something worse. (It’s gonna happen.)

If I’m going to make changes with something as complex as replication, here are my basic requirements:

  • Monitoring must alert the DBA team replication latency exceeds allowed thresholds
  • Monitoring needs to track historical latency to show if my changes reduce latency
  • I need a production-like staging environment to test my changes.

If you haven’t configured monitoring for transactional replication, read how to do it here. The “easy” and “medium” steps are a small amount of work and are incredibly useful.

Don’t Skip “Normal” SQL Server Performance Tuning!

I’m going to give you a lot of transactional replication specific performance tuning tips in this post. But don’t skip other elements of SQL Server performance tuning! Wait statistics, virtual file stats, and identifying bottlenecks are still important. Get started with SQL Server performance tuning here.

1) Are You Using the right versions of Windows and SQL Server for Replication?

For replication performance, you want to be on Windows Server 2008 and SQL Server 2008 minimum. It really makes a difference.

2) Have You Scaled up your distributor?

When replication performance is important, use a dedicated distributor server so that your distributor doesn’t have to fight with a publisher or subscriber for CPU, memory, network, or storage resources. If you need high availability for the distribution database, you have limited options: failover clustering is pretty much the only way to go.

3) Is Replication really what you need?

Check your assumptions. Could you get by with Transaction Log Shipping? Are you a good candidate for Availability Groups?

4) Are You Using the right type of subscriptions and tools for replication over the WAN?

Argenis Fernandez shares what he learned from tuning transactional replication over wide area networks. This is a great use of “pull” subscriptions.

5) Have You Made sure Articles are NOT in more than one publication?

Kendal Van Dyke shows that having articles in multiple publications can also magnify the number of commands in your distribution database. That bloats your distribution database and will slow you way down as activity picks up.

(Note: If you’re using row-filtering on your articles, you may be the exception to this rule.)

6) Do You Only run Snapshots when you need them?

I’ve come across cases where the Snapshot agents for a publication were set to run on a schedule, even when replication wasn’t being initialized. I believe that when someone was setting replication up, they had checked off the option to “Schedule the Snapshot Agent to run at the following times” without realizing that it wasn’t needed. Don’t run snapshots on a schedule, it will lock up the publishing database. (When you open “Job Activity Monitor” these jobs show up with the category “REPL-Snapshot”.)

7) Are you Using “Immediate Sync” to your Advantage?

The immediate sync option is hard to spot when you first set up replication. Setting this to false can help minimize the impact of running a replication snapshot if you need to add new articles, or even remove and re-add a few articles. Learn more about it here. As always, test your changes outside of production first! (I personally have a fear of having immediate sync set to true because of this old bug from SQL Server 2005.)

For information on how the immediate_sync can also impact your distribution database, read more here.

Thanks to Allen McGuire for his comment reminding us on the benefits of this setting!

8) Are You Replicating only the articles and columns you need on the Subscriber?

Don’t just “add all.” For scalability, replicate only the articles that must be in replication, and only the columns that need to be replicated. This not only helps overall performance, this reduces the impact of times when you may need to re-initialize replication. This goes especially for large object / LOB data types. If you must replicate LOB types, Microsoft recommends that you use newer types such as nvarchar(max), varbinary(max), etc.

9) Do You Set the ‘Replicate Schema Changes’ subscription option to false when needed?

New columns being added to a published article shouldn’t be replicated to the subscriber unless they really need to be there. You can turn off the replication of schema changes by setting the ‘Replicate Schema Changes’ subscription option to ‘false’. (It defaults to ‘true’.)

10) Have You Considered static row filters?

“Static row filters” allow you to include only certain rows in a given publication. One gotcha: the row-filter is only evaluated when the row is inserted, not when the row is updated, so you really want this to be a value that doesn’t change. There is overhead to applying the row filter itself: Microsoft only recommends you use the row filters if your replication setup can’t handle replicating the full dataset. Be careful with indexing if you use row filters.

11) Have You Optimized your subscriber databases for re-initialization?

Face it: re-initialization happens. Unfortunately it usually happens when you least expect it and had plans to be doing something else. There are a few things you can do to keep re-initialization from making your publication database unusuable for long periods.

  • Isolate your replication subscriber tables into their own database, and only keep replicated articles in there. This also typically helps you use use recovery models that are optimized for minimal logging in that database to speed up bulk inserts. Consider using synonyms to quickly “repoint” to replicated articles to give you flexibility.
  • Evaluate whether initializing replication from backup could help.

12) Have you Considered Using multiple publications?

There’s pros and cons to splitting out publications. Here’s the pros:

  • You can isolate large tables that are the biggest problems for snapshotting into their own publications so that they get their own snapshots. That can be helpful if there are other tables you might need to remove and re-add to replication more frequently. (The immediate_sync setting can also help with this, see #7 above.)
  • This will give you multiple distribution agents so changes can be applied to your subscribers in parallel. (This one’s a pro and a con, as you’ll see.)


  • This is more work to manage. You should be checking in scripts for your entire replication into source and have documentation on everything. More publications makes that whole process harder.
  • All those distribution agents can backfire if you don’t have the resources to support them working on the subscriber at the same time.
  • Be mindful not to put non row-filtered articles in more than one publication as noted above.

13) Are “Subscription Streams” Right for You? (or not?)

This option allows you to raise the number of connections that the distribution agents use to apply changes to the subscriber. But there’s overhead to managing all these threads, and you can get into situations where transactions aren’t fully consistent if you hit problems.

This feature is primarily recommended for use on situations where you have high network latency and are not changing the setting often. Keep in mind that if you’re splitting your articles into multiple publications for other reasons, you’ve already got multiple distribution agents running in parallel.

14) Are You replicating non-clustered indexes blindly?

Confession: I did this wrong for years. It’s very easy to set up transactional replication and send all the indexes over to the subscriber: you just set “Copy nonclustered indexes” in the articles property to “true”. But you’re only required to replicate the Primary Key and unique indexes. There’s two big problems with replicating all the nonclustered indexes:

  • It can make re-initialization slower. By default the subscriber will have objects created, bulk load the articles, create “extra” nonclustered indexes, then “catch up” on any changes that came in after the snapshot was pushed. You definitely want to make sure that all “extra” nonclustered indexes are disabled or don’t exist while that bulk load is happening. But if a lot may have changed since the snapshot ran, you may not want the indexes to be created until the very end, anyway. Handling the nonclustered index creation outside of replication gives you that flexibility.
  • It’s very rare for the exact same queries to run on the publisher and subscriber. You usually want nonclustered indexes that are specific to the workload on the subscriber, anyway.

Identify the “extra” nonclustered indexes specific to the queries that run on the subscriber. Script them out, check them into your source control, and have a process to deploy them whenever replication needs to be re-initialized.

15) Could publishing stored procedure execution work?

If your workload is run entirely by stored procedures, this can be a great option. (FYI, there is a bug/hotfix for this in SQL Server 2012/2014 listed below.)

16) Are You Using Read Committed Snapshot Isolation on replication subscribers?

It’s common for the distribution agent to have to fight with other processes while it tries to insert, update, and delete rows in the subscriber database. One DBA that I worked with removed a huge amount of blocking and speeded up processing by using RCSI on a subscriber database. Learn more about RCSI here.

17) Have You Ruled Out Virtual Log File fragmentation on the publisher database?

If you’ve got more than 10K virtual log files on your publication database, it could slow down replication. I’ve particularly seen this cause replication to get behind when a large operation like an index rebuild was run. Our free sp_Blitz® script will diagnose high numbers of VLFs for you.

18) You Haven’t Been Randomly Fiddling with Settings on the Distribution Agent, Have You?

There’s a lot of little settings you can change on the Agent Profiles in replication. I’m not a huge fan of changing them unless you can prove they helped your performance, though. These are typically “fine tuning” settings after you have the right architecture in place, in my experience.

19) Have You Looked Out for Replication Bugs / Fixes?

Man, these can get you. Like any other complicated tool, things can go wrong. Here’s a few highlights:

  • KB 2674882 – Deadlocking distribution agents OR replication queries with very high memory grants. This can occur on SQL Server 2005, 2008, or 2008R2.
  • Unexpectedly inactive subscriptions. There are many performance reasons to upgrade from  SQL Server 2005, but if you must be on it then you shouldn’t run anything less than SP4.
  • KB 2958429 - Service Packs Matter. SQL Server 2012 SP2 added some features to replication logging (and even a few improvements to Peer to Peer replication, oddly enough). Scroll to “Highlights for SQL Server 2012 SP2″ and expand “Replication” to see the list. If you apply this service pack, you may also want to apply KB 2969896.
  • KB 2897221 - Stack dumps/ non-yielding schedulers if you’re replicating stored procedures. SQL Server 2012 or 2014.
  • KB 949296 – Replication Agents cannot run when you have many agents and the Windows desktop heap is “used up”. (Thanks to Michael Bourgon for suggesting we link to this one.)
  • The Log Reader Agent may fail to start on SQL Server 2012 Service Pack 1.

Public Safety ANNOUNCEMENT: Replication Needs Special Treatment for Hotfixes and Upgrades

With any hotfixes, it’s always good to review KB 941232, “How to apply a hotfix for SQL Server in a replication topology.” (You’ll need special steps if you’re using Database Mirroring or AlwaysOn Availability Groups.)

Be careful with the steps you take to upgrade and ‘drain’ replicated transactions. If you don’t do this, in some cases you may have big problems during an upgrade.

Got a Transactional Replication Horror Story or Recommendation?

Or do you have a favorite setting I haven’t mentioned here, or something you disagree with? Replication is full of controversy! Share it with us in the comments.

Stabilizing Execution Plans: Plan Guides and NORECOMPUTE

Plan Guides are like duct tape

What could go wrong?

Sometimes you end up in a good plan / bad plan situation: an important query runs just fine most of the time. The query is parameterized, a good execution plan gets re-used, everything is cool.

But sometimes, a “bad plan” gets compiled and starts to be reused. This is “bad” parameter sniffing. “Bad plans” can come in a few varieties: maybe it’s slow some parameter combinations and can cause page timeouts sometimes. Maybe the “bad” query plan has a very large workspace memory grant that just isn’t needed, and it causes problems because lots of different queries are using it — then you get all sorts of nasty Resource Semaphore waits and everything gets slow.

Whatever the situation is, sometimes you want to stabilize a particular execution plan that’s “good” for all the different parameters that the query can run with.

Option 1: Change the code

The very best option is changing the code so you don’t have to resort to shenanigans behind the scenes. You can rewrite the TSQL, change indexes, or use hints to get a specific plan. But sometimes this is difficult to do: maybe it’s vendor code you can’t change. Maybe there’s a long code release process and it will take a very long time to get the code changed.

If you can tune the code, absolutely do it. If you can’t, at least get the request to fix the code noted by the vendor or software development team. Don’t skip it altogether, because the options I describe below aren’t all that fantastic.

Option 2: Plan guide that thing

Plan guides are like duct tape: it’s something you want to have on hand for emergency quick fixes, but you don’t want to rely on it long term as a building material. It’s also not suited for every kind of fix.

Plan guides let you do a few things:

  • Apply query hints like “optmize for value”, “optimize for unknown”, “recompile”, and “maxdop” to a query
  • Selectively turn on trace flags for a query, such as TF 4199 (performance optimizer changes), TF 9481(Older cost-based optimizer if running on SQL Server 2014), TF 2312 (newer cost-based optimizer if running on SQL Server 2014)
  • Add selected table hints, like forceseek and specific index hints. (You cannot add NOLOCK or change isolation levels, don’t get too excited.)
  • “Template” plan guides let you “force parameterize” a specific query, or enforce “simple parameterization” for a certain query if the database is using forced parameterization

But duct tape isn’t perfect. Here’s the biggest gotchas that I’ve found with plan guides:

  • I’ve found that trying to use an index hint in a plan guide can cause queries to silently fail. That’s awkward.
  • Plan guides don’t work with all types of queries. I haven’t been able to get them to work with temporary tables or table variables referenced in the query, for example.
  • Making sure that a plan guide is working and is picking up your query is tricky. Plan guides are very picky about matching query text exactly.
  • Plan guides can make code releases fail. If you’ve got a plan guide that references a stored procedure and something goes to alter it, SQL Server’s going to stop the ALTER with an error.

My biggest advice for plan guides: test them out on a non production system first. Verify that the plan guide is working and that the queries are doing exactly what you want before deploying to production. Treat the plan guide like real code as much as possible: put it into all environments, check it into source, use change control, and document it.

Option 3: Manually Force the “Right” Parameter Sniffing

If you can’t change the code and a plan guide doesn’t work, you can get a little creative. (By “get a little creative”, I mean that everything I’m about to describe can go horribly wrong.)

First, find the bad plan in cache. In SQL Server 2008 and higher, we get a nifty plan_hash for query plans. You can identify the “bad plan hash” that you don’t want to keep in cache. When it’s in cache, you then:

  1. Remove it from the cache. You can do this using DBCC FREEPROCCACHE and the plan_handle value (you can get this by running: sp_BlitzCache @results=’expert’). Or if it’s a stored procedure, you can use the sp_recompile procedure against the stored procedure to blow away the plan. (Thanks @DBArgenis for this tip!)
  2. Run a version of the query that puts the plan you want to be in cache. Usually this means running a version of the query with the parameters that give you the “right” plan.

You’ve got the burden of doing a some testing. Be careful with how you remove plans from cache: sp_recompile requires a schema level (exclusive) lock, so I don’t recommend running that against tables.

You need to generate the plan in a way that’s going to be re-used and make sure it works. You need to make sure that the plan you’re putting in cache really is good for re-use!

Stabilizing execution plans with NORECOMPUTE

NORECOMPUTE - flimsy tape

NORECOMPUTE – flimsy tape

If this doesn’t sound quite crazy enough for you, you can go a little farther and try to increase the chances of your “good” plan staying in cache longer. One tool you can use for this is NORECOMPUTE.

When you update statistics on a table with NORECOMPUTE, you tell SQL Server not to automatically update statistics on the table as data changes. This will automatically happen when approximately 20% of the rows in the table have been modified (the algorithm is more complicated, but 20% is pretty easy to remember). Updated statistics will cause the optimizer to consider a new execution plan for your query. So NORECOMPUTE reduces the chance of the “good” query being bumped out.

If you use NORECOMPUTE, be aware that this could have a negative effect on some queries and cause them to get a terrible estimate on queries that they’re running. You probably want to manually update statistics for the table at least once a day if data changes in it. You can do this using a built in command like sp_updatestats, custom code you write yourself, or Ola Hallengren’s free index maintenance scripts (see Examples B & D for two options). Just please don’t do it with a maintenance plan.

Document the heck out of this. It’s easy for people to have no idea it’s in place, or find it and change it without knowing what it impacts. If plan guides are duct tape, this is more like Scotch Tape.

These Are EMergency Fixes – Don’t Start Your Performance Tuning Process With These Techniques

A lot of times you don’t need any of this crazy stuff. Remember: many times you can fix these issues with good indexing or simple code changes. Safety first!

Learn More in Our Execution Plan Training

Our How to Read Execution Plans Training explains how to get an execution plan, how to read the operators, and learn solutions to common query problems.

Monitoring SQL Server Transactional Replication


“It will catch up eventually. I hope.”

It was the best of times, it was the worst of times. I was a SQL Server DBA, and if something went wrong in Transactional Replication I needed to find out about it right away and help keep things healthy, day or night. Here’s what I learned from that experience about monitoring replication.

If you’re just getting started and need an introduction to transactional replication, head over here.

Tracer Tokens Aren’t Really Your Friend

“Tracer Tokens” were introduced in SQL Server 2005. They sound awfully good. Books Online explains that you can automate them using sys.sp_posttracertoken and report on them using sp_helptracertokenhistory.

There’s a big problem: tracer tokens are too patient.

Let’s say my replication is incredibly overwhelmed and I send out a tracer token. I won’t hear back until it reaches its destination or definitively fails. That could be a very, very long time. The fact that it’s potentially unknown means I don’t want to rely heavily on it for monitoring.

Don’t Rely Too Much on Replication Monitor (REPLMON.exe)

When replication is behind, it’s natural to turn to Replication Monitor. The first five links in “Monitoring Replication” in Books Online point to it, after all.

Replication Monitor isn’t all bad. But don’t depend on it too much, either.

  • Replication Monitor is a tool to help you answer the question “how are things doing right now?” It doesn’t baseline or give the kind of historical info that your manager wants to see.
  • Replication Monitor may run queries to count the number of undistributed commands that may take a while to run and be performance intensive (particularly when things get backed up in the distributor).

I’ve personally seem some cases where running more than one instance of Replication Monitor while a publication snapshot was being taken also caused blocking. Too many people checking to see “how much longer will this take?” actually caused things to take longer. It’s not just me, Microsoft recommends you avoid running multiple instances of Replication Monitor.

ReplMon protip: You can disable automatic refreshing for the Replication Monitor UI, and just refresh the data when you need it. More info in Books Online here. (Thanks to John Samson for this tip.)

Replication Monitor is useful, but you’re better off if people can get information on replication health without everyone having to run Replmon. You can do this fairly easily by using simpler tools to create dashboards to chart replication latency.

Easy Replication Monitoring: Alert on Latency with Canary Tables

It’s easy to build your own system for tracking replication latency for each publication. Here are the ingredients for the simplest version:

  • Add a table named dbo.Canary_PubName to each publication
  • dbo.Canary_PubName has a single row with a datetime column in it
  • A SQL Server Agent job on the publisher updates the datetime to the current timestamp every minute
  • A SQL Server Agent job on the subscriber checks dbo.Canary_PubName every minute and alerts if the difference between the current time and the timestamp is greater than N minutes

It’s very simple to extend this to a simple dashboard using a third party monitoring tool or SQL Server Reporting Services: you simply poll all the dbo.Canary tables and report on the number of minutes of latency on each server.

This simple process gets around the weaknesses of tracer tokens, and also gives you immediate insight into how much latency you have on each subscriber. Bonus: this exact same technique also works well with logshipping and AlwaysOn Availability Groups. Tastes great, less filling.

Medium Replication Monitoring: Notify when Undistributed Commands Rise in the Distribution Database

The distribution database is a special place for Transactional Replication. The log reader agent pulls information on what’s changed from the transaction log of the publication database and translates it into commands that hang out in the distribution database before the changes go out to subscribers.

If you have a lot of data modification occurring on the publisher, you can get a big backup of commands in the distribution database.

If replication performance is important, set up a SQL Server Agent job on your distribution server to regularly check the amount of undistributed commands with a script like Robert Davis provides here. Have it alert you when the commands go above a given threshold.

Real world example: When I was the DBA for an environment with mission-critical replication, we would warn when undistributed commands rose above 500K and create a severity-1 ticket when they rose above 1 million. We did this after setting up dashboards to baseline replication latency and also baselining the amount of undistributed commands in distribution, so that we knew what our infrastructure could recover from and what might need DBA attention to recover in time.

Difficult Replication Monitoring: Alert When Individual Articles are Unhealthy

Here’s where things get tricky. It’s very difficult to prove that all articles in replication are healthy. The steps up to this point have tracked latency for the entire publication and bottlenecks in the distribution database.Things get pretty custom if you need to prove that individual tables are all up to date.

I once had a situation where a code release removed some articles from replication, modified the tables and data significantly, then re-added the articles to replication.

There was an issue with the scripts and one of the articles didn’t get put back into replication properly at the end of the process. Replication was working just fine. No script had explicitly dropped the table from the subscriber, so it just hung out there with stale data. The problem wasn’t discovered for a few days, and it was a bit difficult to track down. Unfortunately, the next week was kind of a downer because a lot of data had to be re-processed after that article was fixed.

Here’s what’s tricky: typically some articles change much more often than others. Monitoring individual articles typically requires baselining “normal” latency per article, then writing custom code that checks each article against the allowed latency. This is significantly more difficult for any large articles that don’t have a “Last Modified Date” style column.

(Disclaimer: in the case that you don’t have a “Last Modified” date on your subscriber, I do not suggest layering Change Tracking on top of the replication subscriber. If you are tempted to do that, first read my post on Performance Tuning Change Tracking, then go through all the steps that you would do if you needed to re-initialize replication or make schema changes on articles. You’ll change your mind by the end.)

Special Cases: The “Desktop Heap” is Used Up

This is a special case for replication. If you have a large amount of replication agents on a single server (such as 200 or more), you may run into issues where things just silently stop working due to desktop heap exhaustion. This is an issue that can be hard to identify because the agents just stop working!

Canary tables can help monitor for this, but you’ll need a lot of them since this can happen on an agent-by-agent basis. Read more about fixing desktop heap problem in replication in KB 949296. (Thanks to Michael Bourgon for suggesting we include this.)

Test Your Monitoring out in Staging

The #1 mistake I find with transactional replication is ignoring the staging environment. This is critical to supporting replication and creating effective monitoring for it.

The staging environment isn’t the same thing as development or QA. It’s a place where you have the same number of SQL Server instances as production, and the same replication setup as production. You test changes against staging before they go to production. You can also use it to test replication changes.

Staging is also where you confirm that your replication monitoring works. Data probably doesn’t constantly change in your staging environment, but that’s OK. Use canary tables and get creative to simulate load for test purposes.

Do You Have a Technique for Monitoring Replication Not Listed Here?

Tell us about it in the comments!