It was the best of times, it was the worst of times. I was a SQL Server DBA, and if something went wrong in Transactional Replication I needed to find out about it right away and help keep things healthy, day or night. Here’s what I learned from that experience about monitoring replication.
If you’re just getting started and need an introduction to transactional replication, head over here.
Tracer Tokens Aren’t Really Your Friend
“Tracer Tokens” were introduced in SQL Server 2005. They sound awfully good. Books Online explains that you can automate them using sys.sp_posttracertoken and report on them using sp_helptracertokenhistory.
There’s a big problem: tracer tokens are too patient.
Let’s say my replication is incredibly overwhelmed and I send out a tracer token. I won’t hear back until it reaches its destination or definitively fails. That could be a very, very long time. The fact that it’s potentially unknown means I don’t want to rely heavily on it for monitoring.
Don’t Rely Too Much on Replication Monitor (REPLMON.exe)
When replication is behind, it’s natural to turn to Replication Monitor. The first five links in “Monitoring Replication” in Books Online point to it, after all.
Replication Monitor isn’t all bad. But don’t depend on it too much, either.
- Replication Monitor is a tool to help you answer the question “how are things doing right now?” It doesn’t baseline or give the kind of historical info that your manager wants to see.
- Replication Monitor may run queries to count the number of undistributed commands that may take a while to run and be performance intensive (particularly when things get backed up in the distributor).
I’ve personally seem some cases where running more than one instance of Replication Monitor while a publication snapshot was being taken also caused blocking. Too many people checking to see “how much longer will this take?” actually caused things to take longer. It’s not just me, Microsoft recommends you avoid running multiple instances of Replication Monitor.
Replication Monitor is useful, but you’re better off if people can get information on replication health without everyone having to run Replmon. You can do this fairly easily by using simpler tools to create dashboards to chart replication latency.
Easy Replication Monitoring: Alert on Latency with Canary Tables
It’s easy to build your own system for tracking replication latency for each publication. Here are the ingredients for the simplest version:
- Add a table named dbo.Canary_PubName to each publication
- dbo.Canary_PubName has a single row with a datetime column in it
- A SQL Server Agent job on the publisher updates the datetime to the current timestamp every minute
- A SQL Server Agent job on the subscriber checks dbo.Canary_PubName every minute and alerts if the difference between the current time and the timestamp is greater than N minutes
It’s very simple to extend this to a simple dashboard using a third party monitoring tool or SQL Server Reporting Services: you simply poll all the dbo.Canary tables and report on the number of minutes of latency on each server.
This simple process gets around the weaknesses of tracer tokens, and also gives you immediate insight into how much latency you have on each subscriber. Bonus: this exact same technique also works well with logshipping and AlwaysOn Availability Groups. Tastes great, less filling.
Medium Replication Monitoring: Notify when Undistributed Commands Rise in the Distribution Database
The distribution database is a special place for Transactional Replication. The log reader agent pulls information on what’s changed from the transaction log of the publication database and translates it into commands that hang out in the distribution database before the changes go out to subscribers.
If you have a lot of data modification occurring on the publisher, you can get a big backup of commands in the distribution database.
If replication performance is important, set up a SQL Server Agent job on your distribution server to regularly check the amount of undistributed commands with a script like Robert Davis provides here. Have it alert you when the commands go above a given threshold.
Real world example: When I was the DBA for an environment with mission-critical replication, we would warn when undistributed commands rose above 500K and create a severity-1 ticket when they rose above 1 million. We did this after setting up dashboards to baseline replication latency and also baselining the amount of undistributed commands in distribution, so that we knew what our infrastructure could recover from and what might need DBA attention to recover in time.
Difficult Replication Monitoring: Alert When Individual Articles are Unhealthy
Here’s where things get tricky. It’s very difficult to prove that all articles in replication are healthy. The steps up to this point have tracked latency for the entire publication and bottlenecks in the distribution database.Things get pretty custom if you need to prove that individual tables are all up to date.
I once had a situation where a code release removed some articles from replication, modified the tables and data significantly, then re-added the articles to replication.
There was an issue with the scripts and one of the articles didn’t get put back into replication properly at the end of the process. Replication was working just fine. No script had explicitly dropped the table from the subscriber, so it just hung out there with stale data. The problem wasn’t discovered for a few days, and it was a bit difficult to track down. Unfortunately, the next week was kind of a downer because a lot of data had to be re-processed after that article was fixed.
Here’s what’s tricky: typically some articles change much more often than others. Monitoring individual articles typically requires baselining “normal” latency per article, then writing custom code that checks each article against the allowed latency. This is significantly more difficult for any large articles that don’t have a “Last Modified Date” style column.
(Disclaimer: in the case that you don’t have a “Last Modified” date on your subscriber, I do not suggest layering Change Tracking on top of the replication subscriber. If you are tempted to do that, first read my post on Performance Tuning Change Tracking, then go through all the steps that you would do if you needed to re-initialize replication or make schema changes on articles. You’ll change your mind by the end.)
Special Cases: The “Desktop Heap” is Used Up
This is a special case for replication. If you have a large amount of replication agents on a single server (such as 200 or more), you may run into issues where things just silently stop working due to desktop heap exhaustion. This is an issue that can be hard to identify because the agents just stop working!
Canary tables can help monitor for this, but you’ll need a lot of them since this can happen on an agent-by-agent basis. Read more about fixing desktop heap problem in replication in KB 949296. (Thanks to Michael Bourgon for suggesting we include this.)
Test Your Monitoring out in Staging
The #1 mistake I find with transactional replication is ignoring the staging environment. This is critical to supporting replication and creating effective monitoring for it.
The staging environment isn’t the same thing as development or QA. It’s a place where you have the same number of SQL Server instances as production, and the same replication setup as production. You test changes against staging before they go to production. You can also use it to test replication changes.
Staging is also where you confirm that your replication monitoring works. Data probably doesn’t constantly change in your staging environment, but that’s OK. Use canary tables and get creative to simulate load for test purposes.
Do You Have a Technique for Monitoring Replication Not Listed Here?
Tell us about it in the comments!
Writing blog posts on transactional replication is like revisiting childhood trauma.
— Kendra Little (@Kendra_Little) June 23, 2014