If SQL Server’s Transactional Replication was an animal, it would be a cockroach. It’s not pretty, but it’s resourceful and strong. Once you get even a small amount of replication in your environment it starts spreading everywhere. And if it gets out of control– well, in that case you’ve got no choice but to pretty much bomb everything, clear the fumes as quickly as possible, and reinitialize.
But unlike a cockroach, you can’t always just get rid of Transactional Replication. The bad news is, it’s here to stay.
In this post you’ll get a whirlwind introduction to Transactional Replication. I’ll explain why newer features like Change Tracking, Change Data Capture, and Availability Groups have failed at exterminating Transactional Replication. You’ll also learn what it takes to support replication in a mission critical environment.
SQL Server Transactional Replication basics
Transactional Replication lets you publish and filter individual database objects.
1) Publications define replicated articles. Transactional Replication lets you select individual tables to publish (called “articles”), allowing you to just publish a fraction of a very large database. It also lets you:
- Filter published data
- Use column filtering to only include critical columns for published tables, further targeting and limiting exactly what is published.
- Apply row filtering to put a “where clause” on a publication. This is a great way to keep a subscribing server from seeing rows that don’t belong to it.
- Publish stored procedures. This essentially takes procedure executions on a publisher and replays them on a subscriber
2) A fleet of SQL Server Agent jobs does magic. I just learned this: a group of cockroaches is called “an intrusion of cockroaches.” That seems like it probably applies to a whole lot of SQL Server Agent jobs, too.
Exactly how many jobs? So many jobs that you’ll lose count. Replication is complicated. It’s got a SQL Server Agent job to read the transaction log of where you’ve defined your publications (Log Reader Agent). There’s another job that helps run Snapshots (it helps you initialize or re-initialize the replication). Another helps apply the snapshot and move transactions around (the Distribution Agent).
There are also jobs for cleaning things up, refreshing monitoring, and all the general scuttling around that replication does.
3) The distribution database collects commands. Replication doesn’t send transactions directly to your subscribers. The Log Reader Agent snoops in your transaction log and figures out what changes have been made to the articles you’ve defined in your publications. When it finds changes, it sends them over to the distribution database.
You can configure distribution on the publishing server, or you can scale it out to an independent server. The important thing to know is that each change you make in your publisher gets translated into independent commands (insert, update, and delete), and these commands get inserted into the distribution database before it moves onward. No matter what you do, that process takes some time: replication isn’t instantaneous.
4) Commands get applied to subscribers. You may set up many subscriptions to a single publication. This is part of what’s cool about replication– a single publication can publish a narrow group of tables to be read by many subscribers (perhaps in different datacenters around the world).
You can elect to configure pull subscriptions or push subscriptions. In the “pull” model, each subscriber runs jobs that poll in data periodically. In the “push” model, the changes are more constantly pushed out to the subscribers from the distributor.
5) Old commands get cleaned up. Hopefully. Lots of commands may collect up in the distribution database. That can be a good thing– maybe a subscriber is offline for a few hours, wouldn’t it be sad if it couldn’t catch up?
But you don’t want to hoard too many commands, or your distribution database will bloat up like crazy. There are jobs that delete out old commands. In an active environment, balancing out cleanup, new activity, and pushing data to subscribers can be… challenging.
6) Then you learn a million factoids. We’ve barely scratched the surface. I haven’t shown you Replication Monitor, a tool whose GUI is so confusing that you don’t dare take your eyes off it. (It’s hard to avoid it, but do use it with care– I’ve had it cause blocking when run by multiple people concurrently in a couple versions of SQL Server.) I haven’t talked about tweaking Replication Agent Profile Settings or the limitations of tracer tokens. We haven’t debated when you might want to initialize from backup.
Transactional Replication is a complicated monster. It’s not something you can book-learn, it’s just got far too many configuration options and moving parts. Replication something you learn by trying it, having problems, fixing it, and then repeating the process.
Transactional Replication vs. Change Tracking
Two new replication-ish features were introduced in SQL Server 2008: Change Tracking and Change Data Capture. These two technologies are both completely separate from replication, and from each other.
Change Tracking, like transactional replication, is available in Standard Edition. That’s a good thing, but if you’re comparing the two features, people still pick replication.
Change tracking just answers a very basic question: has a row changed? It doesn’t capture the “old data” or send the “new data” anywhere– it just updates a marker letting you know that a row has changed since your application last polled it.
It’s up to the user to write an application to regularly poll the tables (and know what version it last tracked).
It’s also recommended for the user to enable snapshot isolation on the database using change tracking and explicitly use it when querying Change Tracking tables to make sure that you don’t miss data or cause blocking with Change Tracking cleanup processes. Snapshot isolation is great, but turning it on and managing it are not at all trivial.
Change Tracking also has some additional overhead. For each row modified in a tracked table, a row is added to a change table. For each transaction that commits, a row is inserted into an internal transaction table, also. (It doesn’t keep the previous values in those tables, so they’re small inserts, but on super active tables that’s extra writes hitting the transaction log of your database– and extra writes when those tables get cleaned up.)
Verdict: Replication is complicated, but Change Tracking ain’t simpler. Due to having much less custom code to write, more freedom with isolation levels, and no extra inserts in internal tables, most people decide that Transactional Replication beats Change Tracking.
Transactional Replication vs. Change Data Capture
How about Change Tracking’s distant cousin, Change Data Capture? This is an Enterprise Edition feature– it must surely perform better than the Standard Edition friendly Transactional Replication, right? Well, sorry, not necessarily.
I love the concept behind Change Data Capture. Imagine that you’re a database administrator managing a transactional replication publication. You have a table that holds advertising campaign keywords.
- Each row in the table contains a CampaignID column, a keyword column, and a BidPrice column
- Some campaigns have hundreds of thousands of keywords
- Tools let advertisers set up campaigns quickly with LOADS of keywords– and the keyword BidPrice
It’s the day before a big holiday. One of the big advertisers goes in, and sets up a huge campaign– 200K rows of changes roll into the table. They realize they messed up and deactivate it right away. 200K more changes roll in. Then they set up a new one. Then the tweak the BidPrice.
Suddenly replication is churning trying to get all these changes into the distribution database, not to mention handle what’s happening in any other replicated tables you have. Replicated commands start backing up.
You’re left saying, “But all I want is the most recent change they made!”
This is the scenario where Change Data Capture sounds like a great idea. The concept with “CDC” is that you write your own polling. If you poll every hour, you just pick up the latest version of the row (or whatever history you want), and take that with you. You don’t have to take every command, so it’s super light-weight.
But CDC isn’t all unicorns and magic, either.
When you enable Change Data Capture for a table, SQL Server starts tracking inserts, updates, and deletes in the transaction log (similarly to replication). But when you make changes, they’re read from the log and then inserted into “change tables” associated with the tracked tables. The change tables are automatically created in the same database as the tracked tables– and inserts into them are also logged operations. So for every modification you do on a change tracked table, you’re also doing an insert (and logging it) into a tracking table.
Data in the tracking tables needs to be cleaned up periodically, of course– and that also gets logged. You need to write your own polling mechanism to pull changes, and there’s limitations to how you can change schema of a table.
While our test scenario seemed like perhaps it would help us out, after checking out a whitepaper on tuning CDC, things don’t look so hot: the whitepaper recommends avoiding scenarios where rows can be updated immediately after insert, and also avoiding scenarios where large update scenarios can occur. Also, our dream of only pulling the latest, or “net” change if a row has been changed multiple times has a cost– that requires an additional index on each tracking table, and the whitepaper points out that this can have a noticeable performance hit. This isn’t seeming like such a great fit anymore– and worse, it sounds like if it doesn’t go well, it’s going to slow down our whole publishing database.
This comparison highlights that although transactional replication commands can get backed up in the distributor, the fact that replication allows distribution to be offloaded to its own server independent from the publisher is a real strength.
Verdict: Due to extra logging caused by internal change tables (and cleanup), plus the need to write custom code, plus limitations on schema changes, most people decide that Transactional Replication beats Change Data Capture.
Transactional Replication vs. Availability Groups
Now here’s a real competitor, right? SQL Server 2012 AlwaysOn Availability Groups allow you to scale out reads across multiple SQL Server instances. You can set up readable secondary copies of a database which have their own independent storage, memory, and server resources. The secondaries can be in either a synchronous or asynchronous mode. You can even offload backups to secondaries.
If anything can render Transactional Replication completely outdated and useless, this is the feature.
It’s absolutely true that Availability Groups are better than transactional replication for high availability. But that’s not surprising– Transactional Replication is terrible for high availability! Even in Standard Edition, a simple two node failover cluster beats Transactional Replication when it comes to HA. So, yes, Availability Groups wins here, but replication doesn’t even make second place. (By a long shot.)
But when it comes to scaling out reads, Availability Groups still aren’t perfect. SQL Server 2012 did introduce a cool new feature where temporary statistics are created for read only databases that helps this feature out, but it doesn’t quite do everything. Availability Group readable secondaries still have some limitations:
- You can’t create indexes on the secondaries– all indexes must be created on the primary
- The more indexes you create on the primary, the heavier your IO load and potential for data churn in maintenance– which makes it harder for secondaries to keep up
- The whole database goes in the availability group. You can’t just include a couple of tables.
Verdict: Using Availability Groups for scale out reads works in some scenarios, but it does not replace Transactional Replication.
Transactional Replication doesn’t solve your HA or DR pains
If you remember one thing from this article, make it this fact: Transactional Replication is not a good way to save your bacon if a server fails or a database gets corrupted. This isn’t a feature for high availability and disaster recovery– it’s much more of a programmability feature to distribute data from a few critical tables from one instance out to many subscribers.
Transactional Replication does NOT support any of the following:
- Failover (or fail-back)
- Connection string enhancements
- Load balancing
- Automatic page repair from corruption
- Protection of the replicated data (it can be updated, made incorrect, etc)
For high availability and disaster recovery, you want to look primarily at other features in SQL Server.
What if you need to write to multiple masters?
If you’ve got a datacenter in China, a datacenter in Europe, and a datacenter in the United States, you probably don’t want to send all of your writes to just one datacenter. You probably want to send writes to the closest local datacenter and have changes synchronize between databases in all of them.
There’s only one feature in SQL Server that helps with this: peer to peer replication. It’s an Enterprise Edition feature which is like two-way transactional replication. Peer to peer replication requires that you handle conflict resolution, and isn’t compatible with all newer features (example: Availability Groups).
Transactional replication has a feature called Updatable Subscriptions that allows changes at a subscriber to be sent back around to the publisher, but the feature has been deprecated in SQL Server 2012. You’re recommended to look to Peer to Peer replication instead.
Requirements for supporting replication
I have a confession to make: I like Transactional Replication. There’s something about it which I admire– its flexibility, its tenacity, its ability to endlessly survive and propagate itself through more and more versions of SQL Server while other features continuously try and fail to exterminate it.
I don’t often recommend that clients use Transactional Replication. Frequently, there are other ways to solve a problem without taking on the burden of supporting Transactional Replication. It’s just not that easy to support.
If you do think you need Transactional Replication, make sure you have the right staff in place. For mission critical data, you need to have multiple, dedicated SQL Server DBAs, an established on-call rotation, a good incident response system, root cause analysis, a pre-production environment, good Change Management processes, and the ability and desire to refine and extend production monitoring.
That may sound like a lot, but trust me, the first time someone asks, “why is the data on the subscriber so old?”, you’re going to realize why I listed a lot of things. One thing you can count on with replication is that there’s sure to be interesting times along the way.