Monitoring SQL Server Transactional Replication

Last Updated 3 years ago

replication-latency — “It will catch up eventually. I hope.”

It was the best of times, it was the worst of times. I was a SQL Server DBA, and if something went wrong in Transactional Replication I needed to find out about it right away and help keep things healthy, day or night. Here’s what I learned from that experience about monitoring replication.

If you’re just getting started and need an introduction to transactional replication, head over here.

Tracer Tokens Aren’t Really Your Friend

“Tracer Tokens” were introduced in SQL Server 2005. They sound awfully good. Books Online explains that you can automate them using sys.sp_posttracertoken and report on them using sp_helptracertokenhistory.

There’s a big problem: tracer tokens are too patient.

Let’s say my replication is incredibly overwhelmed and I send out a tracer token. I won’t hear back until it reaches its destination or definitively fails. That could be a very, very long time. The fact that it’s potentially unknown means I don’t want to rely heavily on it for monitoring.

Don’t Rely Too Much on Replication Monitor (REPLMON.exe)

When replication is behind, it’s natural to turn to Replication Monitor. The first five links in “Monitoring Replication” in Books Online point to it, after all.

Replication Monitor isn’t all bad. But don’t depend on it too much, either.

Replication Monitor is a tool to help you answer the question “how are things doing right now?” It doesn’t baseline or give the kind of historical info that your manager wants to see.
Replication Monitor may run queries to count the number of undistributed commands that may take a while to run and be performance intensive (particularly when things get backed up in the distributor).

I’ve personally seem some cases where running more than one instance of Replication Monitor while a publication snapshot was being taken also caused blocking. Too many people checking to see “how much longer will this take?” actually caused things to take longer. It’s not just me, Microsoft recommends you avoid running multiple instances of Replication Monitor.

ReplMon protip: You can disable automatic refreshing for the Replication Monitor UI, and just refresh the data when you need it. More info in Books Online here. (Thanks to John Samson for this tip.)

Replication Monitor is useful, but you’re better off if people can get information on replication health without everyone having to run Replmon. You can do this fairly easily by using simpler tools to create dashboards to chart replication latency.

Easy Replication Monitoring: Alert on Latency with Canary Tables

It’s easy to build your own system for tracking replication latency for each publication. Here are the ingredients for the simplest version:

Add a table named dbo.Canary_PubName to each publication
dbo.Canary_PubName has a single row with a datetime column in it
A SQL Server Agent job on the publisher updates the datetime to the current timestamp every minute
A SQL Server Agent job on the subscriber checks dbo.Canary_PubName every minute and alerts if the difference between the current time and the timestamp is greater than N minutes

It’s very simple to extend this to a simple dashboard using a third party monitoring tool or SQL Server Reporting Services: you simply poll all the dbo.Canary tables and report on the number of minutes of latency on each server.

This simple process gets around the weaknesses of tracer tokens, and also gives you immediate insight into how much latency you have on each subscriber. Bonus: this exact same technique also works well with logshipping and AlwaysOn Availability Groups. Tastes great, less filling.

Medium Replication Monitoring: Notify when Undistributed Commands Rise in the Distribution Database

The distribution database is a special place for Transactional Replication. The log reader agent pulls information on what’s changed from the transaction log of the publication database and translates it into commands that hang out in the distribution database before the changes go out to subscribers.

If you have a lot of data modification occurring on the publisher, you can get a big backup of commands in the distribution database.

If replication performance is important, set up a SQL Server Agent job on your distribution server to regularly check the amount of undistributed commands. Have it alert you when the commands go above a given threshold.

Real world example: When I was the DBA for an environment with mission-critical replication, we would warn when undistributed commands rose above 500K and create a severity-1 ticket when they rose above 1 million. We did this after setting up dashboards to baseline replication latency and also baselining the amount of undistributed commands in distribution, so that we knew what our infrastructure could recover from and what might need DBA attention to recover in time.

Difficult Replication Monitoring: Alert When Individual Articles are Unhealthy

Here’s where things get tricky. It’s very difficult to prove that all articles in replication are healthy. The steps up to this point have tracked latency for the entire publication and bottlenecks in the distribution database.Things get pretty custom if you need to prove that individual tables are all up to date.

I once had a situation where a code release removed some articles from replication, modified the tables and data significantly, then re-added the articles to replication.

There was an issue with the scripts and one of the articles didn’t get put back into replication properly at the end of the process. Replication was working just fine. No script had explicitly dropped the table from the subscriber, so it just hung out there with stale data. The problem wasn’t discovered for a few days, and it was a bit difficult to track down. Unfortunately, the next week was kind of a downer because a lot of data had to be re-processed after that article was fixed.

Here’s what’s tricky: typically some articles change much more often than others. Monitoring individual articles typically requires baselining “normal” latency per article, then writing custom code that checks each article against the allowed latency. This is significantly more difficult for any large articles that don’t have a “Last Modified Date” style column.

(Disclaimer: in the case that you don’t have a “Last Modified” date on your subscriber, I do not suggest layering Change Tracking on top of the replication subscriber. If you are tempted to do that, first read my post on Performance Tuning Change Tracking, then go through all the steps that you would do if you needed to re-initialize replication or make schema changes on articles. You’ll change your mind by the end.)

Special Cases: The “Desktop Heap” is Used Up

This is a special case for replication. If you have a large amount of replication agents on a single server (such as 200 or more), you may run into issues where things just silently stop working due to desktop heap exhaustion. This is an issue that can be hard to identify because the agents just stop working!

Canary tables can help monitor for this, but you’ll need a lot of them since this can happen on an agent-by-agent basis. Read more about fixing desktop heap problem in replication in KB 949296. (Thanks to Michael Bourgon for suggesting we include this.)

Test Your Monitoring out in Staging

The #1 mistake I find with transactional replication is ignoring the staging environment. This is critical to supporting replication and creating effective monitoring for it.

The staging environment isn’t the same thing as development or QA. It’s a place where you have the same number of SQL Server instances as production, and the same replication setup as production. You test changes against staging before they go to production. You can also use it to test replication changes.

Staging is also where you confirm that your replication monitoring works. Data probably doesn’t constantly change in your staging environment, but that’s OK. Use canary tables and get creative to simulate load for test purposes.

Do You Have a Technique for Monitoring Replication Not Listed Here?

Tell us about it in the comments!

Writing blog posts on transactional replication is like revisiting childhood trauma.

— Kendra Little (@Kendra_Little) June 23, 2014

Our Sessions at the #SQLPASS Summit 2014

New sp_Blitz® v35 Checks for Unsupported and Dangerous SQL Server Builds

79 Comments. Leave new

Chris Miller
July 2, 2014 10:42 am

Replication is the worst possible solution to any given problem. Sometimes it’s the only solution, but it’s always the worst.

Replication Monitor is kinda neat, but it’s very limited. I get concerned when I see all the latency times exactly the same or even worse when they’re all at 0. Either of those are probably critical fails somewhere, and indicative of (usually) undistributed commands. Monitoring undistributed commands I’ve found to be super helpful, and the canary table (we call it watchdog) is just critical. Hooking that date/time up to a monitor package (PRTG or whatever) is really nice.

My favorite thing about Replication Monitor though is how every now and then it’ll show “Uninitialized Subscription” instead of “Running” and then fix itself the next refresh. I really like that little rush of heart skipping a beat there.

Reply
- Kendra Little
  July 2, 2014 10:44 am
  
  PRTG! I was tempted to actually link to that, I really like it.
  
  I stopped for a moment and wondered, “Is replication always the worst? I could probably do something even more terrible with triggers.”
  
  And then I realized: Triggers PLUS replication, that would be the worst.
  
  So I think maybe you’ve got something there.
  
  Reply
  - Robert L Davis
    July 2, 2014 10:48 am
    
    Triggers plus replication is merge replication. 🙂
    
    Reply
    - Kendra Little
      July 2, 2014 10:50 am
      
      You’re right, I probably couldn’t outdo merge replication with my own mess.
      
      Reply
    - Jaime Martinez Lafargue (@jaimeml)
      July 2, 2014 2:20 pm
      
      Right!
      
      Plus extra ROWGUID columns when not already existing, and some metadata tables added to the mix. All of this served with a distribution database, and some IIS configuration, in case of remote servers 🙂
      
      Reply
    - Ian
      August 5, 2021 5:28 pm
      
      Merge replication. That is so bad.
      
      Reply
  - SDC
    August 28, 2015 9:45 am
    
    I got that beat. Some doofus in my past did a combination of triggers and merge replication, with ‘hilarious consequences’.
    
    Reply
Robert L Davis
July 2, 2014 10:43 am

I like replication. (No, I’m not crazy. My mother had me tested.)

Reply
- Kendra Little
  July 2, 2014 10:45 am
  
  Haha! I’m really glad you like replication, otherwise I wouldn’t have had a query for undistributed commands to link to! (I wrote one waaaay long ago, but haven’t looked at it in forever.)
  
  Reply
Ben Thul
July 2, 2014 10:59 am

It’s interesting that you don’t like tracer tokens but suggest using a canary table. Tracer tokens are the canary that’s built into the system. If you’re going to automate an update to a dbo.Canary table, you can certainly automate a call to sp_posttracertoken. The advantages that I see are that the tracer token gives you an indication of what point in the chain is slow (that is publisher ? distributor or distributor ? subscriber) and that you get history. I just wrote a quick post about it (http://www.spartansql.com/2014/07/looking-at-tracer-tokens.html).

Reply
- Kendra Little
  July 2, 2014 11:05 am
  
  I think the canary system is simpler and more reliable. What a manager wants to know is how latent the data is in the subscribers– not what the distributor thinks about latency.
  
  You can also run into pitfalls with the tracer token where you have to account for a token having not reached a destination. Yeah, you can code around it, but it’s much easier to just check LastModifiedDate in a canary table and return the difference. There’s also not a lot that can break in my monitoring if I’m applying upgrades to SQL Server for a simple Canary table solution, but they could change the tracer token procs.
  
  I have nothing against using tracer tokens for diagnostic purposes for finding out where the latency is. I just think there’s a much simpler, more elegant way to do monitoring.
  
  [Edit: wording issues]
  
  Reply
  - Ben Thul
    July 2, 2014 11:17 am
    
    I think we’ll have to agree to disagree, but if you’re posting tokens regularly (i.e. through a schedule), the answer to “how far behind is the subscriber?” is knowable. Moreover, I think that the proverbial manager really wants to know is “when will it be current again?”. In my experience, I can at least start looking in the right place if I have an idea of which leg is slow. Most times it’s the d ? s leg, but on the off chance that it isn’t I don’t want to be looking there and saying “everything’s fine here, boss”.
    
    Reply
    - Kendra Little
      July 2, 2014 11:22 am
      
      Oh, yeah– I’m not saying that there’s no way to make a tracer token solution work. I’m just saying you have to code around some of its inherent issues, and if someone had to debug your monitoring code when you’re out of the office it would be a LOT harder for them.
      
      One of the biggest pain points I find people have with replication is its complexity. So if there’s any opportunities to simplify, I’m all about that.
      
      Reply
Heather Sullivan
July 2, 2014 11:58 am

I just ran into an issue yesterday where my undistributed transaction count was very high, but looking at my data/checking timestamps, and running the replication validation check were fine. Replication Monitor itself was green and showed no latency, so there was a disconnect.

We had a network outage for over an hour, it seemed as though we hit a timeout point where the Replication Agents Checkup job was no longer cleaning up. Fortunately a restart of the agents on both sides cleared it up, but I was able to use some of this as a reference for troubleshooting.
http://blogs.msdn.com/b/sqlserverfaq/archive/2010/09/17/sometimes-replication-monitor-shows-number-of-undistributed-transactions-as-a-very-high-number-but-replication-itself-works-properly.aspx

We do have a monitoring solution in place, but I think because of the high number of transactions listed in the undistributed queue that had not been cleared out it was timing out on getting the true number.

I can’t wait to get away from Peer-to-Peer replication, we’ve been fortunate to recover from the peer-to-peer conflicts, but taking downtime to add new articles is a massive pain to coordinate. We’re moving to Multi-subnet Availability Groups, which I know have some other considerations, but they look to address a number of our requirements better now than replication.

Reply
- Kendra Little
  July 2, 2014 1:08 pm
  
  Oh, interesting! What were you using to count the number of undistributed commands?
  
  Reply
  - Heather Sullivan
    July 2, 2014 2:21 pm
    
    I meant to add in my original post that in certain situations, you still want to manually verify beyond your monitor. We had just had an incident, so I was manually checking the system via Replication Monitor as well as my monitoring tool. I found the number by chance by looking in Replication Monitor, this situation was a bit unusual as we had to failover to our DR site due to the the network issues, which also caused an update-update conflict from an inflight transaction before the network outage (fortunately recoverable without a restore).
    
    In my case, I don’t believe the tracer or canary items would have helped, as replication was still running successfully – it was the history cleanup that was failing.
    
    WIth the above link, I pulled out the data to try to confirm. My updated script:
    ###Part 1:
    Declare
    @PublisherDB sysname,
    @Publication sysname,
    @agentid sysname;
    
    — Set Publisher database name
    Set @PublisherDB = N”;
    — Set Publication Name
    Set @Publication = N”;
    — Get agent id
    Set @agentid = (select id from MSdistribution_agents where publisher_db =@PublisherDB and subscriber_db = @PublisherDB and Publication = @Publication)
    
    –Get xact_seqno & delivery_rate
    select top 1 xact_seqno, delivery_rate from dbo.MSdistribution_history
    where agent_id = @agentid
    and runstatus in (2,3,4) order by time desc;
    
    —Returned:
    xact_seqno delivery_rate
    0x0003B65E000474200001000000000000 13.05
    
    ###PART 2:
    DECLARE @avg_rate int
    DECLARE @retcode int
    DECLARE @mstestcountab TABLE ( pendingcmdcount int )
    
    select @avg_rate = isnull(avg(13.05),0.0)
    from dbo.MSdistribution_history
    where agent_id = 3
    
    insert into @mstestcountab (pendingcmdcount)
    exec @retcode = sys.sp_MSget_repl_commands
    @agent_id = 3
    ,@last_xact_seqno = 0x0003B65E000474200001000000000000
    ,@get_count = 2
    ,@compatibility_level = 9000000
    
    select pendingcmdcount from @mstestcountab
    
    select
    pendingcmdcount
    ,N’estimatedprocesstime’ = case when (@avg_rate != 0.0)
    then cast((cast(pendingcmdcount as float) / @avg_rate) as int)
    else pendingcmdcount end
    from @mstestcountab
    
    Reply
    - Kevin3NF
      January 31, 2023 8:19 pm
      
      I so needed this today…and many times over the years 🙂
      
      Reply
Brandon Williams
July 2, 2014 12:44 pm

Great post Kendra.

Similar to your Canary table solution, you can still rely on tracer tokens by creating a temp table and inserting the tracer token ID and post time into the temp table, wait for a delay, and then execute sp_helptracertokenhistory to see if the token has posted to the Distributor and/or Subscriber(s) yet.

This logic can be put into a loop and if the token has not posted at the Distributor and/or Subscriber(s) within a predetermined threshold, you can consider the latency to be alert-worthy and notify operators appropriately.

Reply
- Kendra Little
  July 2, 2014 1:07 pm
  
  Yep, Ben suggested something similar.
  
  I’m not a huge fan of it because you have to do some very specific things to avoid the problem of “the tracer token waits to finish”.
  
  And also, you’re really just querying the distributor really when you do this. If the distributor is having problems, don’t you lose all the info about how far behind all the subscribers are?
  
  So while I do see that this is possible and some people seem to like it, the code complexity seems high to me and it just seems like there’s a lot of pitfalls that aren’t there in a simpler canary table scenario.
  
  Reply
  - Brandon Williams
    July 2, 2014 1:24 pm
    
    If the distributor is having problems, the result set will return NULL. In that case, you can still look at the post time in your temp table to see if you have reached your threshold and alert if necessary. It’s quite simple actually.
    
    Reply
Tara Kizer
July 2, 2014 2:03 pm

Regarding the Canary table and adding it to each publication, what about the situation where you have multiple publications and each could have vastly different latency values? Would that Canary table have a PublicationName, or similar, column? What would you suggest in this case?

Reply
- Brent Ozar
  July 2, 2014 2:04 pm
  
  Tara – I like doing it with a different Canary table per publication, but I’m crazy like that.
  
  Reply
  - Tara Kizer
    July 2, 2014 2:19 pm
    
    Crazy indeed. I imagine a multi-purpose table instead: MonitorName, UpdateTime. Then each thing you want to monitor has a row in the table.
    
    Either way works, I just prefer less clutter. I’m a control freak type DBA. ;D
    
    Thanks for the idea of a Canary table! I am going to make good use of it.
    
    Reply
    - Kendra Little
      July 2, 2014 2:22 pm
      
      This is a great question. I updated the post to refer to dbo.Canary_PubName so it’s a little clearer.
      
      Since there can be performance issues with having non-filtered articles in multiple publications, I don’t like to violate it even for monitoring. Makes it easier to run a quick query checking for that and have it come back totally clean.
      
      Reply
      - Heather Wright
        December 2, 2014 6:32 pm
        
        Would a single canary table work if the articles were added as filtered (on their own row) articles?
        
        Great article (no pun intended) Kendra!
      - Kendra Little
        December 3, 2014 10:39 am
        
        It may be a lack of coffee at this time of morning, but I’m not sure what you mean?
        
        (We’re at the end of comment threading, so to reply back, you’ll probably have to add a new comment that’s not a reply.)
iMedia Designs
July 3, 2014 1:41 am

Great post Kendra. Thanks

Reply
Taiob
July 3, 2014 12:01 pm

I use this to count undistributed commands. I know I am hard coding the agent id here and that is avoid a scan on ms_repl and msdistribution_history.
SELECT
agent_id
, SUM(UndelivCmdsInDistDB) AS [pending_transaction]
FROM
distribution..MSdistribution_status s WITH ( NOLOCK )
WHERE
agent_id IN ( 18, 19, 25, 30, 31, 40, 41, 42, 43, 46 )
GROUP BY
agent_id

Reply
- Brandon Williams
  July 3, 2014 11:52 pm
  
  Very nice.
  
  You can also report on name (distribution agent name), publication name, subscriber_id, subscriber_db, and other columns if you join on MSdistribution_agents on agent id.
  
  Reply
jeff_yao
July 3, 2014 1:40 pm

IMHO, adding a table to the publication is really equal to reinventing wheels, esp, when you have multiple publications in multiple publishers. The best way is to have a central monitoring server to send out “monitoring script”, which is based on tracer token, to each publisher and then get the result. The beauty of the tracer token method is you will know which part takes longest time in the path from the publisher to the subscriber. For example, if you see there is a prolonged latency between publisher and the distributor, you may need to check whether log reader is under pressure (due to huge transaction logs generated at the publisher side). I once used both ways (i.e. tracer tokens and canary tables), but I use tracer tokens for monitoring purpose, but use canary for verification purpose, i.e. whether the publication really works because sometimes, an article in the publication may get error while the other articles are fine, I just need to use my canary table for (auto) test during the troubling time.

Reply
- Brent Ozar
  July 3, 2014 2:04 pm
  
  Jeff – OK, cool, sounds like you’ve got your solution all under control. Feel free to blog about your solution or post an example online, too – other folks would love to see how you’re doing it. Have a great 4th!
  
  Reply
- Kendra Little
  July 7, 2014 12:51 pm
  
  I totally agree with Brent — would love to see the details. I think the most interesting thing would be for you to include how you work around the problem of potential delayed notification from tracer tokens due to their “patience”.
  
  Reply
  - jeff_yao
    April 9, 2015 1:41 am
    
    Kendra, you mentioned “Let’s say my replication is incredibly overwhelmed and I send out a tracer token. I won’t hear back until it reaches its destination or definitively fails. That could be a very, very long time. The fact that it’s potentially unknown means I don’t want to rely heavily on it for monitoring.” But if this is a problem, it is also the same problem to CANARY table. When using tracer token for monitoring purpose, I usually define a threshold for the total latency (based on the average latency history, which can be dynamically auto-adjusted, like average latency for the last 7 days) and if latency goes beyond the threshold by 50%, an alert will be issued.
    
    Anyway, I am back to this article because I am writing an article for mssqltips.com at this time and I want to do a quick research to ensure I am not writing something that has already been written.
    
    Reply
John Stafford
July 4, 2014 3:07 am

Like Kendra, I don’t find tracer tokens much use – except to confirm there is a problem when they don’t make it to the subscriber!
Instead I use custom metrics in Red Gate’s SQL Monitor to check how many undistributed commands there are for a specific publication and send out an email with status Medium/High based on threshold I have set for that particular publication (as some are much busier than others). I use a slightly modified version of the system sproc sp_replmonitorsubscriptionpendingcmds to just return the one value for pendingcmdcount.

Reply
- Kendra Little
  July 7, 2014 12:36 pm
  
  Oh, cool that you got it hooked into your monitoring system! Nice.
  
  Reply
Dave Satz
July 7, 2014 8:53 am

Thanks for the new ideas!

We use tracer token solution that runs every 1/2 an hours – meant to supplement replicatio monitor alerts – that can dynamically send tokens through all the pubs in a database. The SProc will send an email alert of all the tokens that do not complete in 180 seconds. This has only been useful when we see these alerts for several hours based on our SLAs, but nonetheless, they have been useful.

We also have 3 scheduled outages for rebuilding indexes at which time we also run a SProc that compares row counts between the subscriber and publishing databases and emails an alert with any tables that are out of synch

HTH,
Dave

Reply
- Kendra Little
  July 7, 2014 12:41 pm
  
  Interesting. How did you rig it up so that the sproc doesn’t wait on the tokens? (IE if a token takes 3 hours to complete, how does the sproc not return an error for 3 hours?)
  
  Reply
Leon
July 7, 2014 3:28 pm

I could never get tracer tokens to work years ago in SQL 2008 so I abandoned that strategy, although the canary tables mentioned above seem interesting. Instead I hijacked the system tables for replication monitor and scrape that for latency data. Granted it’s not always accurate, but by adding in a 2nd criteria for # of undelivered commands it helps to filter out the bogus latency alerts from replication monitor. I just insert the data into a remote table and alert off of it. This method isn’t as concrete, but historically it’s been good enough to at least let my team know something is wrong with replication and start an investigation. This only runs on the distribution server for transactional replication without updateable subscribers. Hope it helps:

SELECT
md.publisher,
md.publication,
md.agent_name,
md.publisher_db,
a.article,
ds.UndelivCmdsInDistDB,
CASE WHEN md.[status] = 1 THEN ‘Started’
WHEN md.[status] = 2 THEN ‘Succeeded’
WHEN md.[status] = 3 THEN ‘In Progress’
WHEN md.[status] = 4 THEN ‘Idle’
WHEN md.[status] = 5 THEN ‘Retrying’
WHEN md.[status] = 6 THEN ‘Failed’
ELSE ‘Other’
END [Status],
CASE WHEN md.warning = 0 THEN ‘OK’
WHEN md.warning = 1 THEN ‘Expired’
WHEN md.warning = 2 THEN ‘Latency’
ELSE ‘OTHER’
END [Warnings],
md.cur_latency / 60.0 AS [Latency(min)],
@@SERVERNAME
FROM Distribution.dbo.MSdistribution_status ds WITH(NOLOCK)
INNER JOIN Distribution.dbo.MSarticles a WITH(NOLOCK)
ON a.article_id = ds.article_id
INNER JOIN Distribution.dbo.MSreplication_monitordata md WITH(NOLOCK)
ON md.agent_id = ds.agent_id
WHERE
ds.UndelivCmdsInDistDB > 500
AND (md.cur_latency / 60.0) > 30
–ORDER BY md.cur_latency DESC
GO

Reply
- Leon
  July 7, 2014 3:43 pm
  
  Forgot to mention, we run these two commands on the distribution server to refresh the data prior to the select above. MS says @refreshpolicy is for internal use only. I think we profiled and caught SQL (2008) running these commands when we hit refresh so we copied/pasted and ran with it =). That said, I can’t vouch for the accuracy of the @refreshpolicy values below, but it seems to work in SQL 2012.
  
  exec [Distribution].sys.sp_replmonitorhelpsubscription @publisher = NULL, @publication_type = 1, @mode = 0, @exclude_anonymous = 0, @refreshpolicy = N’0′
  GO
  exec [Distribution].sys.sp_replmonitorhelppublication @publisher = NULL, @refreshpolicy = N’1′
  GO
  
  Reply
  - Kendra Little
    July 7, 2014 3:59 pm
    
    Hey Leon,
    
    Are you checking the data returned from those for some values?
    
    In my experience people often run sys.sp_replmonitorrefreshjob @iterations=1, like Robert Davis mentions in the link I have above for monitoring undistributed commands. I’m not sure if it calls the procs that you’re talking to or if they’re related.
    
    Kendra
    
    Reply
    - Leon
      July 8, 2014 10:27 am
      
      In the first script we check for latency > 30mins and # of undistributed commands > 500 which then drives our alerts.
      
      For the sp_replmonitorhelp procs above we don’t use the data returned at all. Our understanding was by passing in the values for @refreshpolicy it would update the data in MSreplication_monitordata. We could have been totally wrong in that assumption, if so please let me know.
      
      Reply
      - Kendra Little
        July 8, 2014 12:52 pm
        
        Haha, I have no idea 🙂 I’ve never tested that parameter out with those.
    - Leon
      July 8, 2014 10:39 am
      
      btw, thanks for the sys.sp_replmonitorrefreshjob @iterations=1 suggestion. I’ll be sure to check it out. I think it says a lot about the state of SQL replication when undocumented stored procs and ‘internal use only’ parameters are being thrown around as recommended ways of monitoring.
      
      Reply
      - Robert L Davis
        July 8, 2014 1:07 pm
        
        The procedure you’re calling calls sys.sp_replmonitorrefreshjob. I’m just cutting out the middleman. 🙂
Chetan
July 8, 2014 11:21 am

I agree with Leon and I also liked the suggestion about sys.sp_replmonitorrefreshjob @iterations=1 suggestion. In my environment I have several distribution databases and several distribution servers so I ended up creating a proc that can send a HTML report against entire distributor server.

CREATE procedure [dbo].[usp_ReplicationLatency_All]
as
begin
set nocount on
set transaction isolation level read uncommitted

declare @distdbname varchar(100)
declare @sql varchar(8000)

set @distdbname = ”

create table ##FinalResults
(
cur_latency numeric(9,2),
publisher varchar(50),
publisher_db varchar(50),
publication varchar(50),
article varchar(100),
subscriber varchar(50),
subscriber_db varchar(50),
UndelivCmdsInDistDB varchar(50),
agent_status varchar(15),
schedule varchar(100)
)

while exists( select top 1 name from sys.databases where is_distributor = 1 and name > @distdbname )
begin

select top 1 @distdbname = name from sys.databases where is_distributor = 1 and name > @distdbname

insert into ##FinalResults
–set @sql =
exec (‘select
convert(numeric(9,2),cur_latency/60.00),
s.srvname,
d.publisher_db,
d.publication,
b.article as article,
c.srvname,
a.subscriber_db,
isnull(UndelivCmdsInDistDB,0),
case mrm.status when 1 then ”Started”
when 2 then ”Succeeded”
when 3 then ”In progress”
when 4 then ”Idle”
when 5 then ”Retrying”
when 6 then ”Failed”
end as agent_status,

case when msdb.dbo.udf_schedule_description(sc.freq_type,sc.freq_interval,sc.freq_subday_type,sc.freq_subday_interval,sc.freq_relative_interval,sc.freq_recurrence_factor,sc.active_start_date,sc.active_end_date,sc.active_start_time,sc.active_end_time) = ”Automatically starts when SQLServerAgent starts.”
then ”Continuous” else msdb.dbo.udf_schedule_description(sc.freq_type,sc.freq_interval,sc.freq_subday_type,sc.freq_subday_interval,sc.freq_relative_interval,sc.freq_recurrence_factor,sc.active_start_date,sc.active_end_date,sc.active_start_time,sc.active_end_time) end as schedule
FROM ‘ + @distdbname + ‘.dbo.MSsubscriptions a with (nolock)
inner join ‘ + @distdbname + ‘.dbo.MSarticles b with (nolock)
on a.publication_id=b.publication_id and a.article_id =b.article_id
inner join master..sysservers c with(nolock) on a.subscriber_id=c.srvid
inner join master..sysservers s with (nolock) on a.publisher_id = s.srvid
inner join ‘ + @distdbname + ‘.dbo.MSpublications d with (nolock) on a.publication_id = d.publication_id
left outer join ‘ + @distdbname + ‘.dbo.MSdistribution_agents mdm
on d.publication = mdm.publication
and mdm.subscriber_id = c.srvid
left outer join ‘ + @distdbname + ‘.dbo.MSdistribution_status mds
on mds.article_id = b.article_id
and mds.agent_id = mdm.id
left outer join ‘ + @distdbname + ‘.dbo.MSreplication_monitordata mrm
on mrm.agent_id = a.agent_id
join msdb.dbo.sysjobs j
on j.name = mrm.agent_name
join msdb.dbo.sysjobschedules sjc
on sjc.job_id = j.job_id
join msdb.dbo.sysschedules sc
on sc.schedule_id = sjc.schedule_id
where mrm.agent_type = 3
and convert(numeric(9,2),cur_latency/60.00) > 30
and isnull(UndelivCmdsInDistDB,0) > 500’)

–print @sql

end

if (select count(1) from ##FinalResults where Agent_status != ‘Succeeded’) > 0
begin

DECLARE @tableHTML NVARCHAR(MAX) ,
@subj varchar(100)
SET @tableHTML =
N’Replication Latency Report’ +
N” +
N’Please investigate on the below Replication for High Latency !!!’+
N’Distribution Server: ‘ + @@SERVERNAME + N” +
N” +
N” +
N’Latency Mins’+
N’PublisherPublisher_DBPublicationArticleSubscriberSubscriber_dbUndelievered’ +

N’DistributorStatus
DistributionSchedule
‘ +
CAST ( (select
td= cast( convert(numeric(9,2),cur_latency) as varchar(10)),”,
td= cast(publisher as varchar(50)),”,
td= cast(publisher_db as varchar(50)),”,
td= cast(publication as varchar(50)),”,
td= cast(article as varchar(100)),”,
td= cast(subscriber as varchar(50)),”,
td= cast(subscriber_db as varchar(50)),”,
td= cast(isnull(UndelivCmdsInDistDB,0) as varchar(50)),”,
td = cast(agent_status as varchar(50)), ”,

td = cast(schedule as varchar(50))

FROM ##FinalResults
where Agent_status != ‘Succeeded’
order by publisher, publisher_db, cur_latency desc, publication,subscriber

FOR XML PATH(‘tr’), TYPE
) AS NVARCHAR(MAX) ) +
N” –+

SET @subj = ‘Replication Latency Report ‘ + @@servername

EXEC msdb.dbo.sp_send_dbmail
@profile_name = ”,
@recipients = ”,
— @query = N’select @@servername as DISTRIBUTION_SERVER’,
@subject = @subj,
@body = @tableHTML,
@body_format = ‘HTML’

end

drop table ##FinalResults

end

Based on my experience, I have not been able to detect log reader latencies. The above mechanism only helps to detect a distributor to subscriber latency. I would like to know if you guys have any ways to track log reader latencies. I have not yet tested the Canary tables.

Reply
- Kendra Little
  July 8, 2014 12:58 pm
  
  So, I didn’t mention this in the post, but I also did the HTML report thing. I was so torn on it, because it’s so tricky to edit and support (any null value causes a blank email), but it was also really fast for me to set up and it got me out of a pinch. Funny that you did the same thing!
  
  For the overall latency table, I do find the canary table to be the simplest, easiest best. You’ll see that others in the comments defend the honor of the tracer token, though.
  
  Reply
  - Robert L Davis
    July 8, 2014 1:13 pm
    
    I did too. I logged the replication latency info to a table and sent out an email report every morning showing the latency every hour for the previous 24 hours. This was the process I used to generate the email: http://www.sqlsoldier.com/wp/sqlserver/buildinghtmlemailswithsqlserverandxml
    
    Reply
  - Chetan
    July 8, 2014 1:20 pm
    
    if (select count(1) from ##FinalResults where Agent_status != ‘Succeeded’) > 0
    
    This will also take care of blank report scenario. It will prevent from a blank report being sent. However, I would like to correct the IF condition. It needs to be,
    
    if (select count(1) from ##FinalResults) > 0
    
    Apologies for not picking it up from the correct source at my end.
    
    Also I missed to provide the text for the function msdb.dbo.udf_schedule_description:
    
    http://www.mssqltips.com/sqlservertip/1622/generate-sql-agent-job-schedule-report/
    
    Reply
John Halunen
July 18, 2014 12:42 pm

Rich and I ran into the desktop heap issue with our Search replications in SQL2000 . IIRC the limit for that version was ~165 and we went well over that on the distributor when we added a couple of new servers with many pubs. Was pulling my hair out for a few days…

Reply
- Kendra Little
  July 18, 2014 12:43 pm
  
  Oh my goodness. I forgot that environment used replication in SQL Server 2000, and I didn’t recall hearing about the desktop heap issue at all. That sounds… awful. Congrats on surviving.
  
  Reply
Anilkumar Jantikar
July 23, 2014 4:48 am

Yes we have to get replication status by using below tables.

dbo.MSreplication_monitordata
dbo.MSdistribution_agents
dbo.MSpublicationthresholds

Reply
Scott
July 27, 2014 2:31 pm

Thanks for this great article Kendra, I am using trans replication at the moment and I didn’t know thethe limitations with Replication Monitor.

Thanks

Scott

Reply
Sunder
August 1, 2014 5:38 am

Hi,

we are facing an issue, that after the re-indexing activity on the replication configured DB … the replication not working and only the transaction log file of the DB keep on growing.

Please suggest.. what may be the reason.

Reply
- Brent Ozar
  August 1, 2014 6:49 am
  
  Sunder – if you know the replication isn’t working, then you’ll need to troubleshoot why it’s not working. Unfortunately that’s beyond the scope of a blog post comment, but what kind of replication is it?
  
  Reply
Sunder
August 1, 2014 6:56 am

Hi Brent,

Thanks a lot for your response, to elaborate on this more that we have configured Transaction replication for a DB of almost 100 GB, the transaction replication works fine for 6 days, on Sunday we have DB re-indexing activity scheduled. after the re-indexing activity the replication stop working and the log file of the DB keep on growing.

now we can not truncate the log file and replication is also not working. we have checked the sql error logs but not able to fine the actual reason.

Reply
- Kendra Little
  August 1, 2014 10:04 am
  
  OK– I hope nobody’s *tried* truncating the log files. (I’m not sure what you mean by “now” we can not.) Doing that will definitely mess up replication.
  
  Have you looked through our post on perf tuning replication? (https://www.brentozar.com/archive/2014/07/performance-tuning-sql-server-transactional-replication-checklist/) This sounds to me like you could be suffering from issue #17 (excessive virtual log files), but that’s a big guess and it could also be a few other things in that list.
  
  Reply
  - Sunder
    August 7, 2014 4:52 am
    
    Hi Kendra ,
    
    Thanks much for your reply and support.
    
    may be I am not able to describe the issue to you properly.. let me try to explain again.
    
    after tables re-indexing the transaction log file grows to 100 GB and the replication stops working, so my question is does the tables re-indexing activity creating the issue for replication where the transaction log file grows to 100 GB.
    
    if we don’t perform the tables re-indexing the replication works fine, so i am not sure the re-indexing or the big transaction log file is creating issue for transaction replication.
    
    Reply
    - Kendra Little
      August 7, 2014 8:50 am
      
      I do understand what you’re saying. Index rebuilds put pressure on the transaction log, and can cause it to grow. Replication reads from the log. Therefore putting pressure on the transaction log can cause replication to slow down. This can be particularly bad if you have high fragmentation in virtual log files. It can also be due to a general lack of resources at that time, and you’d need to look at your wait stats to sort that out.
      
      Reply
Taiob
September 11, 2014 2:27 pm

Based on your blog coded up a complete solution and in use with a SSRS reporting.
http://sqlworldwide.com/2014/09/11/monitoring-transaction-replication-latency-in-real-time/

Reply
- Robert French
  March 18, 2015 4:14 am
  
  The url has now changed to http://sqlworldwide.com/monitoring-transaction-replication-latency-in-real-time/
  
  Reply
Heather Wright
December 8, 2014 4:34 pm

Regarding adding a single canary table: would having a single table work if it had a row for each publication and each publication had the canary table (same table) as an article, but filtered on only its own row?

You mentioned performance issues with having non-filtered articles in multiple publications, but if the articles were filtered (so that they each owned a single row in the table), could we get away with a single table then?

Sorry, I hope that explains it a bit better.

Reply
- Kendra Little
  December 8, 2014 6:19 pm
  
  Oh! I see what you mean now. That sounds like it would work. You’re testing something slightly different if it’s the only places you’re using a row-filtered subscription, but if that’s OK then I can’t think of a reason it wouldn’t work.
  
  Reply
Yong Ji
June 16, 2015 12:48 pm

Have you considered perfmon counter to monitor latency? I’d like to know if there’s any disadvantage with this approach.

SELECT OBJECT_NAME
, counter_name
, RTRIM(instance_name) + ‘-LogReader’ AS instance_name
, ROUND(cntr_value / 1000,0) AS latency_sec
FROM master.sys.dm_os_performance_counters
WHERE OBJECT_NAME LIKE ‘%Replica%’
AND counter_name LIKE ‘%Logreader:%latency%’
UNION
SELECT OBJECT_NAME
, counter_name
, RTRIM(instance_name) + ‘-Distributor’ AS instance_name
, ROUND(cntr_value / 1000,0) AS latency_sec
FROM master.sys.dm_os_performance_counters
WHERE OBJECT_NAME LIKE ‘%Replica%’
AND counter_name LIKE ‘%Dist%latency%’

Reply
- Kendra Little
  June 16, 2015 12:51 pm
  
  I haven’t tested this counter for accuracy. Do you know specifically what it’s measuring and have you found it to be accurate? (I think it’s just measuring transfer time / between instances, not overall latency, but I haven’t looked into it much or tested it.)
  
  Reply
- Nalini Devarakonda
  August 15, 2016 3:57 pm
  
  Hello Yong,
  
  Can you confirm if this query gives accurate results?
  
  Reply
  - Yong Ji
    October 18, 2016 8:14 am
    
    Hi Nalini
    
    I wasn’t able to validate what Kendra asked. So I’m using “sys.sp_replmonitorrefreshjob” as suggested in Medium Replication Monitoring: Notify when Undistributed Commands Rise in the Distribution Database section.
    
    Reply
Chuzzlewit
February 25, 2016 5:34 pm

Hi Kendra, I do pretty much as you described for a multi-node P2P topology. I like that you call it a canary table – no need to explain what it does with a name like that. P2P replication faults/errors generally mean the system is down until it’s fixed.

To make it easy to trace from any single node to another, I added a hostname column, as well as the datetime column in my canary table. I insert the output of “select @@servername, getdate()” via a stored proc.

Nothing spectacular but it’s simple and allows me to check replication from each node in turn. The hostname tells me where it’s from and if I do have a breakdown, I can test the insert on each node.
It’s also confirmation that I’ve fixed the problem before allowing users back on the system.

Reply
vatsal
July 15, 2016 8:00 am

very useful information.Thanks!

Reply
rajeev kumar tiwari
April 4, 2017 12:08 am

basically i liked the output you showed me before, i would like it captured in a table, say hourly, and then at the end of the stored proc, we can add quick logic to say if : 1) the # of undistributed cmd > 50000 OR 2) the latency is > 1 min, then send an email with the condition, and have this SP running in a job, say every hour, alert works too but it is not capturing the result into a table for historical/trending purposes

Reply
rajeev kumar tiwari
April 4, 2017 12:09 am

Hi Kendra can you help me

Reply
- Brent Ozar
  April 4, 2017 6:31 am
  
  Rajeev – for Q&A, head on over to http://dba.stackexchange.com.
  
  Reply
Leverage that Boring Generic Request to Build Your Tool Belt. | JBC
November 7, 2019 2:50 pm

[…] of putting together a manual transaction replication health check similar to what Kendra Little outlines here. But instead of just taking the 2 hour billable only route… I took the scripts I fleshed out […]

Reply
Turn that Boring Task into Consulting Gold | JBC
November 7, 2019 3:03 pm

[…] of putting together a manual transaction replication health check similar to what Kendra Little outlines here. But instead of just taking the 2 hour billable only route… I took the scripts I fleshed out […]

Reply
Adam K
October 5, 2021 8:15 pm

I realize this is an old article, but I’m particularly interested in the link you have to check the number of undistributed commands. The link is broke, and this is exactly what I’m looking for. Do you happen to have the script, or know of something similar? Thanks.

Reply
- Brent Ozar
  October 5, 2021 8:25 pm
  
  No, Robert passed away, unfortunately.
  
  Reply
  - Adam K
    October 5, 2021 8:26 pm
    
    Sorry to hear that. Thanks for the qucik reply.
    
    Reply
- Sean Peters
  October 9, 2021 11:01 pm
  
  I think this is the latest link from further down in the comments : https://sqlworldwide.com/monitoring-transaction-replication-latency-in-real-time/
  
  Link still works.
  
  Reply
Tom Kroljic
August 16, 2023 3:38 pm

Late to the party…I’m new to SQL replication. We implemented replication about a month ago in a test environment. I was looking for different methods to monitor the environment when I came upon this article. I like the concept of the “Canary” table(s).
Question: To create a canary table for each publication, would I be correct in stating that you would need to add the table to the Database that contains the table you are replicating? In my SQL environment, not sure if I would get access to creating new tables in the production database. Can these canary tables exist in a different database on the same server?

Reply
- Brent Ozar
  August 17, 2023 4:14 pm
  
  I would do the same database.
  
  Reply