Want to Avoid Deployment Downtime? Replication Probably Isn’t the Answer.

Architecture
15 Comments

You’re tired of taking outages to deploy changes to your database schema.

Wouldn’t it be cool to have two different SQL Servers so that you could:

  • Point your customer traffic at ServerA
  • Apply your deployment scripts to ServerB (which isn’t taking any customer traffic)
  • Point your customer traffic over to the newly deployed ServerB
  • Apply your deployment scripts to ServerA (which isn’t taking any customer traffic)

Presto! All your servers are up to date and in sync, right? Well, no, not in SQL Server at least.

Illustrating the problem with a simple schema change

We’ll use the Posts table in the Stack Overflow database as an example:

The Posts table holds questions & answers. In the dark days of the original design, the Tags column held up to 5 tags to describe a question. In our new design, we’d like to break Tags out to a separate child table, one per row.

If we only have one SQL Server, our deployment script looks like this:

  1. Create a new PostsTags table:
    CREATE TABLE dbo.PostsTags (PostId INT, Tag NVARCHAR(40));
  2. Shut down the app
  3. Populate that table (which is going to require more work than I’m prepared to type out here)
  4. Alter the stored procedures that insert/update rows in dbo.Posts so that they now populate the new dbo.PostsTags table
  5. Start the app back up again

Why shut down the app? Because people can still be inserting & updating rows in Posts while steps 3 & 4 are happening. If I don’t shut down the application, I’ll need to do some kind of additional coding steps afterward to detect the dbo.Posts rows that changed while steps 3 & 4 ran, and make the appropriate changes in dbo.PostsTags.

There are absolutely ways around this problem if you have just one SQL Server:

  • Deploy the dbo.PostsTags table early, and modify the code to keep BOTH the dbo.Posts.Tags column and the dbo.PostsTags table in sync with new inserts/updates, or
  • Write triggers to keep both the column and the table in sync, or
  • Write another process (like SSIS, ADF, or your own custom app) to keep them in sync

But none of those are built into SQL Server. They’re all additional development work, not traditionally considered part of database administration.

Will replication solve this?

The line of thinking is that if we had TWO SQL Servers, then we could just make these changes to a server while it isn’t facing any user traffic:

  1. Create a new PostsTags table:
    CREATE TABLE dbo.PostsTags (PostId INT, Tag NVARCHAR(40));
  2. Populate that table (which is going to require more work than I’m prepared to type out here)
  3. Alter the stored procedures that insert/update rows in dbo.Posts so that they now populate the new dbo.PostsTags table

All the while that these changes are happening, some other SQL Server somewhere is handling all the application traffic – so we don’t have any data consistency problems, right?

Wrong: that other SQL Server is still getting inserts & updates to the Posts table. Replication only keeps the Posts table in sync: it doesn’t translate changes between the Posts table and PostsTags.

Replication COULD be used to translate changes between two different database schemas – but that’s up to you. It’s not as simple as checking a box inside replication. You have to write your own custom replication stored procedures, and that’s left as an exercise for the reader.

And if you’re willing to do that…you don’t need replication. Remember back in the last section how I said there were absolutely ways around this problem if you only had one SQL Server? Go back to that section – those options are simpler than introducing replication.

Replication also makes high availability way harder.

That sounds backwards for folks who grew up with other database platforms where replication is a valid solution for high availability and disaster recovery, hahaha.

However, SQL Server’s high availability and disaster recovery mechanisms rely on the transaction log. ServerA and ServerB will both have completely different transaction logs – after all, they’re taking different changes at different times. (One gets live customer data, then gets patched – the other is happening in the opposite order.)

Because they have different transaction logs, you have to protect them each individually. If you’re using Always On Availability Groups for HA/DR protection, that means you would have two separate Availability Groups, likely each in their own set of servers. (You can’t use the same servers if you want to use just one database name: database name has to be unique per server.)

So you’d end up with a convoluted set of replicas for both ServerA and ServerB. You’ve just doubled your server count, patching, and disaster recovery work.

Note that I didn’t say it was impossible.

You could absolutely do all this – and I’ve had one client in the past who actually did. They were even able to reduce their (perceived) outage windows because:

  • Their deployment scripts were extremely disk-intensive: they liked to rewrite entire tables to delete or reorder columns
  • Their storage sucked, so big logged changes took forever
  • They didn’t mind replication getting way out of date (like hours) – as long as users could see any data on the site, they didn’t care if it was accurate or not, so the replication lag during deployments was tolerable
  • They simply didn’t bother with high availability or disaster recovery for the replication systems – they snapshotted entire servers daily, and they were willing to lose a day of data
  • They were willing to massively staff up to write their own custom replication stored procedures – they had a team of 5-6 people who just wrote replication stored procs for deployments (because they were a mature application with hundreds of tables, each of which required their own custom procs when they made schema changes, and Lord knows they loved to make schema changes, but they wanted the replication procs to be extremely well-tested)

You can do all that. Frankly, though, I just wouldn’t bother – use the single-server methods instead, or standardize on only doing additive, not destructive, changes.

Previous Post
Are You Underpaid? Find Out in the Data Professional Salary Survey Results.
Next Post
Which Microsoft Certification Should You Get?

15 Comments. Leave new

  • I have found having replication takes up large amount of DBAs time to manage. Introducing replication is one of life’s big decisions akin to or choosing partner or career.

    Reply
  • Pierre Letter
    January 6, 2021 8:22 am

    Reading this, another post came to mind. I can’t remember who made me look at it and I wonder if it isn’t you, Brent. But Mickael J Swart wrote a post a few years back on how to do 0 downtime deployment, talking about database state. Here it the link: https://michaeljswart.com/2018/01/100-online-deployments/
    And yeah, it does involve the dev team doing stuff smartly, not (only) the DBA.

    Reply
  • Wait, where are the screenshots where you setup replication? 😉 I had to touch replication on Monday. Monday was not a good day.

    Reply
  • Couple things:
    * Suhail: saying repl takes up large amounts of time may be your experience, but isn’t everyone’s. Simple tran repl, in my experience, is rock solid. I have some rules though…never set it up with the ui, always have it scripted, avoid snapshotting
    * clearly the best option is to not do things in sql that will cause long duration migrations or excessive locking. This requires your devs to do things smartly. What Brent proposes above is sometimes called the “deprecation model” where we keep 2 versions of schema up to date until the older gets deprecated. Works great for APIs, isn’t always a usable solution for databases.
    * I actually built a replication-based “schema evolution” system I called Darwin and it was used in production for HUGE hospital systems and regardless of size of change we could do “evolve” a schema with about 7 seconds of downtime.

    Here’s the tldr. Let’s assume you need to change dbo.Tran.BigCol from decimal(7,1) to decimal(20,1). If you’ve ever done this you know the entire table needs to be rewritten.
    – create stage and deprecation schemas
    – apply the new table script to stage schema with the new col width
    – do intra-database tran repl from dbo.Tran.BigCol to stage.Tran.BigCol. make sure the snapshot agent is set to database snapshot to avoid the long term locking issues
    – change whatever things you need to change to support the new col size…views, procs, whatever, but do it in stage schema. You’ll also need to apply GRANT scripts to Stage objects
    – test as long and as much as you like
    – when ready to apply the change to “prod”…with automation I can make this 7 seconds of downtime, regardless of the size or quantity of changes
    – shut down your app, shutdown repl
    alter schema Deprecated transfer dbo.Tran
    alter schema dbo transfer stage.Tran
    –run the same alter schema transfer for the other dependent objects
    –run grant statements
    –teardown replication.

    if you don’t like tranrepl due to HA concerns…no problem…use whatever mechanism you want to keep dbo in sync with stage…a sql agent job works well.

    Reply
  • -create the new and changed table in stage schema
    -write the migration script. this can be run on a schedule, migrating just the latest changes, or with a custom sp_MSIns proc, etc.
    -when ready to migration, shut everything down, do the alter schema transfer commands. done. should be about 7 seconds. that’s about how long it takes to run the transfer and grant scripts.

    the magic is you are continuously running the migration script. how you do that is up to you. but it needs to process the data incrementally.

    Reply
    • Dave – right, exactly, the “magic” is that you have to *WRITE* a migration script that is idempotent and deals with changes. That isn’t magic. That’s hard work.

      Reply
      • but, I’m confused. You always have to write a migration script (or SSIS or whatever) right? nothing will do that magically. How you execute that script is up to you. maybe a scheduled task is best. my point is it can be embedded in the repl procs (probably not the best way) and then you get idempotency at a lesser cost (maybe, depending on if you snapshot).

        Looking at the title of your post you are calling out that tranrepl isn’t a great approach to limit deployment downtime. I don’t see any way to do the migration without the migration script. if you want to limit downtime that script needs to be idempotent and do things in batches. Where I see tranrepl as useful is allowing me to have a new real time sync of data so I can do these VERY RARE db evolutions where I need to do blue/green.

        Reply
        • Not necessarily idempotent, though. Most folks write their migration scripts as run-once operations, not idempotent. Idempotent is way, way harder.

          Reply
          • I should qualify though – I actually wouldn’t be surprised if all your deployment scripts are idempotent because you’re savvy as hell. You’re a good example of the kind of diligence it takes to do this right.

  • I agree with you Dave. I should qualify that my statement was geared toward the UI point-and-click crowd who think implementing replication incur little management cost. You have a healthy view of replication and the steps you listed provides actual data points on what it takes to manage replication. You also have right the mindset of automating your work and being diligent in testing.

    Reply
  • thanks for the kind words, and you are right, all of my scripts are idiot-potent.

    Any time you have a large table that needs “a migration script” that may take some time or take unacceptable locks or might rollback b/c you blew out tempdb or whatever, due diligence has to be taken to ensure mitigation. I saw your recent post about transactional batch updates. That should be part of any good tsql developer’s learning curriculum. And I’d like to see more of those posts. Things like “destructive reads” are not understood by enough database programmers. We need more education around how to write good migration scripts (batch scripts, etc). If I had a nickel for every migration script I’ve seen that tried to update a col in a table, worked in every environment but prod, failed in prod b/c the transaction rolled back due to tran log space, and the response by the developer was “it worked on my test db which has 100 rows in the table”…I’d have enough for a Model X. Gnaw mean?

    Reply
  • Mike Kierstead
    July 22, 2021 5:13 am

    One of the hardest things some teams have to learn is that your service does not need 100% uptime or even 5 9s. I get that things can cost millions in lost revenue per hour. I would rather give up a predictable amount of revenue loss for a schema patch I can schedule than deal with a catastrophic malfunction in a over engineered “HA” deployment solution with company ending data loss.

    I managed to avoid replication until an AzureSQL migration project. I had the “Oh, this is why people hate this feature” moment many times.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.