Blog

salmon-swimming-shutterstock_154697786

Swim faster!

SQL Server transactional replication is a blessing and a curse. It’s a great developer tool that lets you scale out data to multiple servers, even using Standard Edition. But as your business picks up, your datasets get larger, and your customers grow more demanding, replication can start to fall behind. You need to learn how to tune it to keep up.

Before You Start Tuning Replication….

Make sure you can prove when changes you make to replication improve performance. Or find out quickly if you make something worse. (It’s gonna happen.)

If I’m going to make changes with something as complex as replication, here are my basic requirements:

  • Monitoring must alert the DBA team replication latency exceeds allowed thresholds
  • Monitoring needs to track historical latency to show if my changes reduce latency
  • I need a production-like staging environment to test my changes.

If you haven’t configured monitoring for transactional replication, read how to do it here. The “easy” and “medium” steps are a small amount of work and are incredibly useful.

Don’t Skip “Normal” SQL Server Performance Tuning!

I’m going to give you a lot of transactional replication specific performance tuning tips in this post. But don’t skip other elements of SQL Server performance tuning! Wait statistics, virtual file stats, and identifying bottlenecks are still important. Get started with SQL Server performance tuning here.

1) Are You Using the right versions of Windows and SQL Server for Replication?

For replication performance, you want to be on Windows Server 2008 and SQL Server 2008 minimum. It really makes a difference.

2) Have You Scaled up your distributor?

When replication performance is important, use a dedicated distributor server so that your distributor doesn’t have to fight with a publisher or subscriber for CPU, memory, network, or storage resources. If you need high availability for the distribution database, you have limited options: failover clustering is pretty much the only way to go.

3) Is Replication really what you need?

Check your assumptions. Could you get by with Transaction Log Shipping? Are you a good candidate for Availability Groups?

4) Are You Using the right type of subscriptions and tools for replication over the WAN?

Argenis Fernandez shares what he learned from tuning transactional replication over wide area networks. This is a great use of “pull” subscriptions.

5) Have You Made sure Articles are NOT in more than one publication?

Kendal Van Dyke shows that having articles in multiple publications can also magnify the number of commands in your distribution database. That bloats your distribution database and will slow you way down as activity picks up.

(Note: If you’re using row-filtering on your articles, you may be the exception to this rule.)

6) Do You Only run Snapshots when you need them?

I’ve come across cases where the Snapshot agents for a publication were set to run on a schedule, even when replication wasn’t being initialized. I believe that when someone was setting replication up, they had checked off the option to “Schedule the Snapshot Agent to run at the following times” without realizing that it wasn’t needed. Don’t run snapshots on a schedule, it will lock up the publishing database. (When you open “Job Activity Monitor” these jobs show up with the category “REPL-Snapshot”.)

7) Are you Using “Immediate Sync” to your Advantage?

The immediate sync option is hard to spot when you first set up replication. Setting this to false can help minimize the impact of running a replication snapshot if you need to add new articles, or even remove and re-add a few articles. Learn more about it here. As always, test your changes outside of production first! (I personally have a fear of having immediate sync set to true because of this old bug from SQL Server 2005.)

For information on how the immediate_sync can also impact your distribution database, read more here.

Thanks to Allen McGuire for his comment reminding us on the benefits of this setting!

8) Are You Replicating only the articles and columns you need on the Subscriber?

Don’t just “add all.” For scalability, replicate only the articles that must be in replication, and only the columns that need to be replicated. This not only helps overall performance, this reduces the impact of times when you may need to re-initialize replication. This goes especially for large object / LOB data types. If you must replicate LOB types, Microsoft recommends that you use newer types such as nvarchar(max), varbinary(max), etc.

9) Do You Set the ‘Replicate Schema Changes’ subscription option to false when needed?

New columns being added to a published article shouldn’t be replicated to the subscriber unless they really need to be there. You can turn off the replication of schema changes by setting the ‘Replicate Schema Changes’ subscription option to ‘false’. (It defaults to ‘true’.)

10) Have You Considered static row filters?

“Static row filters” allow you to include only certain rows in a given publication. One gotcha: the row-filter is only evaluated when the row is inserted, not when the row is updated, so you really want this to be a value that doesn’t change. There is overhead to applying the row filter itself: Microsoft only recommends you use the row filters if your replication setup can’t handle replicating the full dataset. Be careful with indexing if you use row filters.

11) Have You Optimized your subscriber databases for re-initialization?

Face it: re-initialization happens. Unfortunately it usually happens when you least expect it and had plans to be doing something else. There are a few things you can do to keep re-initialization from making your publication database unusuable for long periods.

  • Isolate your replication subscriber tables into their own database, and only keep replicated articles in there. This also typically helps you use use recovery models that are optimized for minimal logging in that database to speed up bulk inserts. Consider using synonyms to quickly “repoint” to replicated articles to give you flexibility.
  • Evaluate whether initializing replication from backup could help.

12) Have you Considered Using multiple publications?

There’s pros and cons to splitting out publications. Here’s the pros:

  • You can isolate large tables that are the biggest problems for snapshotting into their own publications so that they get their own snapshots. That can be helpful if there are other tables you might need to remove and re-add to replication more frequently. (The immediate_sync setting can also help with this, see #7 above.)
  • This will give you multiple distribution agents so changes can be applied to your subscribers in parallel. (This one’s a pro and a con, as you’ll see.)

Cons:

  • This is more work to manage. You should be checking in scripts for your entire replication into source and have documentation on everything. More publications makes that whole process harder.
  • All those distribution agents can backfire if you don’t have the resources to support them working on the subscriber at the same time.
  • Be mindful not to put non row-filtered articles in more than one publication as noted above.

13) Are “Subscription Streams” Right for You? (or not?)

This option allows you to raise the number of connections that the distribution agents use to apply changes to the subscriber. But there’s overhead to managing all these threads, and you can get into situations where transactions aren’t fully consistent if you hit problems.

This feature is primarily recommended for use on situations where you have high network latency and are not changing the setting often. Keep in mind that if you’re splitting your articles into multiple publications for other reasons, you’ve already got multiple distribution agents running in parallel.

14) Are You replicating non-clustered indexes blindly?

Confession: I did this wrong for years. It’s very easy to set up transactional replication and send all the indexes over to the subscriber: you just set “Copy nonclustered indexes” in the articles property to “true”. But you’re only required to replicate the Primary Key and unique indexes. There’s two big problems with replicating all the nonclustered indexes:

  • It can make re-initialization slower. By default the subscriber will have objects created, bulk load the articles, create “extra” nonclustered indexes, then “catch up” on any changes that came in after the snapshot was pushed. You definitely want to make sure that all “extra” nonclustered indexes are disabled or don’t exist while that bulk load is happening. But if a lot may have changed since the snapshot ran, you may not want the indexes to be created until the very end, anyway. Handling the nonclustered index creation outside of replication gives you that flexibility.
  • It’s very rare for the exact same queries to run on the publisher and subscriber. You usually want nonclustered indexes that are specific to the workload on the subscriber, anyway.

Identify the “extra” nonclustered indexes specific to the queries that run on the subscriber. Script them out, check them into your source control, and have a process to deploy them whenever replication needs to be re-initialized.

15) Could publishing stored procedure execution work?

If your workload is run entirely by stored procedures, this can be a great option. (FYI, there is a bug/hotfix for this in SQL Server 2012/2014 listed below.)

16) Are You Using Read Committed Snapshot Isolation on replication subscribers?

It’s common for the distribution agent to have to fight with other processes while it tries to insert, update, and delete rows in the subscriber database. One DBA that I worked with removed a huge amount of blocking and speeded up processing by using RCSI on a subscriber database. Learn more about RCSI here.

17) Have You Ruled Out Virtual Log File fragmentation on the publisher database?

If you’ve got more than 10K virtual log files on your publication database, it could slow down replication. I’ve particularly seen this cause replication to get behind when a large operation like an index rebuild was run. Our free sp_Blitz® script will diagnose high numbers of VLFs for you.

18) You Haven’t Been Randomly Fiddling with Settings on the Distribution Agent, Have You?

There’s a lot of little settings you can change on the Agent Profiles in replication. I’m not a huge fan of changing them unless you can prove they helped your performance, though. These are typically “fine tuning” settings after you have the right architecture in place, in my experience.

19) Have You Looked Out for Replication Bugs / Fixes?

Man, these can get you. Like any other complicated tool, things can go wrong. Here’s a few highlights:

  • KB 2674882 – Deadlocking distribution agents OR replication queries with very high memory grants. This can occur on SQL Server 2005, 2008, or 2008R2.
  • Unexpectedly inactive subscriptions. There are many performance reasons to upgrade from  SQL Server 2005, but if you must be on it then you shouldn’t run anything less than SP4.
  • KB 2958429 - Service Packs Matter. SQL Server 2012 SP2 added some features to replication logging (and even a few improvements to Peer to Peer replication, oddly enough). Scroll to “Highlights for SQL Server 2012 SP2″ and expand “Replication” to see the list. If you apply this service pack, you may also want to apply KB 2969896.
  • KB 2897221 - Stack dumps/ non-yielding schedulers if you’re replicating stored procedures. SQL Server 2012 or 2014.
  • KB 949296 – Replication Agents cannot run when you have many agents and the Windows desktop heap is “used up”. (Thanks to Michael Bourgon for suggesting we link to this one.)
  • The Log Reader Agent may fail to start on SQL Server 2012 Service Pack 1.

Public Safety ANNOUNCEMENT: Replication Needs Special Treatment for Hotfixes and Upgrades

With any hotfixes, it’s always good to review KB 941232, “How to apply a hotfix for SQL Server in a replication topology.” (You’ll need special steps if you’re using Database Mirroring or AlwaysOn Availability Groups.)

Be careful with the steps you take to upgrade and ‘drain’ replicated transactions. If you don’t do this, in some cases you may have big problems during an upgrade.

Got a Transactional Replication Horror Story or Recommendation?

Or do you have a favorite setting I haven’t mentioned here, or something you disagree with? Replication is full of controversy! Share it with us in the comments.

↑ Back to top
  1. Very useful post. I’ve been using transactional replication for years, but this gave some more things to consider. Just a FYI, your link for “configured monitoring for transactional replication” is not working. Thanks for the post.

  2. One setting I’ve found invaluable is @immediate_sync – when you use the wizard to set up replication and accept the defaults, it’s set to true. That means any time you add an article to a publication and run the snapshot agent, the snapshot contains all articles. By setting it to false, it will only contain the newly published article(s).

    This is very, very useful. The following post is the one I keep handy for the syntax:
    http://www.replicationanswers.com/TransactionalOptimisation.asp

    • Oooo, nice link. I’ll update the post in a bit to add something about immediate_sync.

      I have a fear of having immediate_sync set to true because of the SQL Server 2005 bug I hit that’s listed under “Unexpectedly inactive subscriptions.”

    • This is now #7 in the list. It ranks high enough not to be added at the end, I think!

  3. If you have a large shop you definitely want to change some of the settings for logreader/distribution agents. For example: CommitBatchSize, CommitBatch Threshold,Packet size, MaxBcpThreads.

    • Hey Taiob,

      These are the settings that I think people need to be careful with and prove that changing them has a positive impact on performance. I’ve known many people to randomly tweak these and not know if it helps or hurts, or if they set them to a good or bad value. Then they’re afraid to change them back.

      So I don’t think it’s bad to tweak these, I just think these are more “fine tuning” settings and would look at everything in this post first, and also make sure I have some way to prove these help (which can be tricky).

      kl

  4. You might include something about the whole Desktop Heap issue (TL;DR – too many agents use up a particular memory bucket which can only be a certain size, regardless of the amount of RAM you have),

    We’ve run into it because we have a large number of subscribers and a large number of publications. The desktop heap issue is a silent, deadly one. The first time you hear about it is because it turns out one of your agents hasn’t done anything since it started. If you’re lucky, you have periodic subscribers that fall out of sync, alerting you something’s amiss. I’ve written 3 different automated repl monitors to try and catch them in different ways.

    • Ooo, I’m glad you mentioned this! I was reading KB 949296 (http://support.microsoft.com/kb/949296) the other night and wondered, “does this really happen to a lot of people?”

      Couple questions if you wouldn’t mind answering them. Do you have more than 200 agents? Or has it hit you with less than that? (That’s the number they mention in the KB article.)

      What’s the best approach you’ve found to monitoring this?

      I added a link to this to #11 in this post, and also added a mention over in our post on Monitoring Transactional Replication (http://www.brentozar.com/archive/2014/07/monitoring-sql-server-transactional-replication/)

      • Ha! I suspect we’re an outlier (we have our subs broken up because we have several large tables replicated, and different levels of latency are required for others – just because table X is massively updated once a week, we don’t want slowdowns in table Y).

        (Aside: SQL Server 2000 didn’t have this problem, and it was because the agent worked differently, and (IIRC) all pubs going to the same sub used the same agent. When 2005 came out they changed it, which was good (one bad pub couldn’t invalidate others), but caused issues with number of the distrib.exe running.)

        200 seems very…. optimistic. We had to change things around 150. The way we initially caught it was because we had a monitor to make sure tables were in sync across servers, and some just wouldn’t sync, or would (and then others would fail. We wound up writing 3 separate replication monitors for it (they’re on my blog, fwiw).

        The initial fix at that point was to move periodic replications from continuous to every X minutes, and bump up the desktop heap by 64kb (and later 128kb). I don’t remember seeing the error messages that the KB mentions, and I remember the jobs were succeeding. Once we added more subs, though, we had to move to the two-tier architecture, which gave us more heartache (but a ton more flexibility).

        • If you’re an outlier then I’m there with you. I agree with your threshold of 150, ours was a little under that actually. We hit a similar wall with desktop heap on Server 2012/SQL 2012 (same was true for 2005/2008). We’re at 180 agents now (log reader + distribution agents) which run continuously. Our snapshot agents only run once/week and they’re staggered so they don’t overlap. If they all run at the same time then our distribution server will croak. Currently our desktop heap is at 2048KB which I think allows us to go past the 150 mark. I call baloney on the 200 mark in the KB, at least without changing the heap settings.

  5. Kendra,
    Thank you for your post. I read it with much interest. I am new to replication and am trying to take it all in. Can you recommend any of your training, in person or video, which would be useful related to replication?

  6. Also, wanted to add tablediff.exe to the mix. Even though this utility has existed since SQL 2005, we’ve just discovered it. We have it running across our entire environment and it automatically corrects any data discrepancies within our replicated tables. Here’s the MSDN article explaining it for anyone interested.

    http://msdn.microsoft.com/en-us/library/ms162843.aspx

    Thus far the tool has been wonderful and has relieved my team from the manual data synch burden that has plagued us for years. The larger question I have for MSFT is, why can’t this tool just be rolled into the native replication toolkit, why the separate .exe outside of SQL? But I suppose that’s another topic for another day…

  7. Thanks Kendra, great stuff! We encountered the “watermark”-bug which lead to this: http://support.microsoft.com/kb/2922526 – fixed in CU13 (SP2) for SQL 2008 R2. It was nasty to say the least…

    I would strongly recommend that if you do use [@sync_type=] ‘replication support only’ that you only do so on SQL 2008 R2 and higher…

    The -SkipErrors parameter of the Distribution Agent did help (not advised), as did redgate Data Compare!

  8. Hi Kendra

    Your RCSI link above hits a 404. Can you post a new link or, better yet, give us your opinion on the use of RCSI on a transactional replication subscription?

    Thanks!

  9. I think it’s also important to have structure and find a good way too organise this all. Some things you’d written here are very applicable to my own behavior. Like only making snapshots when I need them. That’s very frustrating, because I often need them and then I have to invest time in this. Even though I can rather do a bunch in just one try. But I like the way you think, so I think your checklist is definitely going to help me out.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php