“We want to archive, but we still want to query the data.”

Development
11 Comments

At least once a month, I get this question from a client:

We have big data, and we want to save money. We want to move the older data down to some kind of archives that will cost us less.

I ask exactly one followup question:

Are you willing to modify the app that queries the data, or do you want the users to still be able to query the old data in their existing screens and reports?

If you’re willing to modify the app – like having an “archive reports” section for when users wanna query the older data, and use a different connection string pointing to an archival server, then great! It’s easy. Just put together a cheap 4-core SQL Server Standard Edition VM with crappy storage, and archive the data over to that. I have all kinds of tips to make that easier and cheaper.

But you don’t wanna modify the app.

You want to still be able to use the same screens and same reports. When you wanna run an all-history trending report, or find Acme’s order from 2002, you wanna use the exact same app and same reports.

You just want old data to magically be cheaper.

Table partitioning doesn’t do that. In theory, table partitioning lets you put new data on expensive/fast storage, and old data on cheap/slow storage. In practice:

  • None of your storage connected to production servers is really that cheap.
  • The data’s still inside the same database, so maintenance jobs like backups, high availability, corruption checking, and statistics updates are still time-intensive, and development servers are still huge.
  • As data ages, and you wanna move it to cheap/slow storage, that’s a row-by-row logged & locked operation. The bigger your data is, the uglier of a process this is.
  • The index designs have to be the same across all partitions. (You can do some fancy stuff with filtered indexes, but it’s a giant pain in the keister.)
  • Table partitioning comes with a lot of design & implementation considerations, basically requiring rewriting the whole table and all its indexes.
  • Every single one of your queries have to have WHERE clause filters on the partitioning column (which is inevitably a date range in these scenarios). If not, query performance gets worse because SQL Server can’t eliminate the old partitions.

Partitioned views kinda do it – but not really. In theory, they suffer from most of the above disadvantages too, with a couple of interesting exceptions:

  • The data can be in different databases, so old stuff can be in an Archive database. You can seal that one as read-only when you’re not transferring data over from the live database. That makes backups, high availability, corruption checking, and statistics updates faster.
  • Indexes can easily be wildly different for archives – like columnstore, or tons of indexes without sacrificing the live data’s write performance.

But the rest of partitioning’s disadvantages still apply, and you’ll wanna review those, because they usually end up as showstoppers.

Linked servers kinda do it – but not really either. In theory, you can put the old data on a cheap Standard Edition VM with less frequent backups, less high availability, and maybe even no disaster recovery. You replace the old production table’s name with a view that does a union-all across the current live data, plus a reference to the linked server’s archival data. In practice:

  • If any query doesn’t have the partitioning column (like date) as part of its where clause, performance can be spectacularly bad, because SQL Server doesn’t always filter data over on the archival server. Sometimes it brings all of the archive data across the network wire over to the production server, and does the filtering there.
  • Linked server data isn’t cached. When multiple users run multiple queries, even simultaneously, their query data isn’t shared or cached. It’s fetched from the archive server over the network into your production server’s memory, taking up space every time, replacing stuff that your production server used to cache.
  • As data ages, and you wanna move it to cheap/slow storage, this is really slow since we’re moving across servers now, and writing to a really slow destination.

I’ve seen a lot of people try to use the linked server approach. They test a bunch of queries, work to make sure they don’t touch the linked server, and then go live – only to get a horrible surprise about the queries they didn’t test. It only takes a handful of frequently-called queries accidentally fetching data from the linked archive server to bring your production box to its knees.

Try the linked server approach first.

If you still wanna try archival after reading the above limitations, then try the linked server approach. It’s the one that will produce the most business value if you can get it to work. Here’s how to approach it conceptually:

  • Set up a development environment with two servers: “production” and “archive”
  • Do the development work to replace the production tables with views that union-all across your current production data, plus the archival data on the linked server
  • Archive data over to the archival server, taking it out of production – you wanna accurately simulate the data distribution that you’re going to see in the real-life environment
  • Capture a trace of activity in real-life production for a business day
  • Set up a trace of activity on the development “archive” server
  • Replay the production trace on the development “production” server
  • Stop the trace of activity on the development “archive” server, and review its contents. Those are the queries that are going to hit your development “archive” server in real life. If you’ve done your work correctly, few queries will be hitting that server – because if they are, then you’re frequently querying archive data, and what was the point of all this?

It’s a lot of work – but it sure is better than yolo’ing your way through this and discovering how bad it’s going to be after you’ve already archived in production, and you have to move it all back.

Previous Post
Free SQL Server Spring Training on Performance Tuning

11 Comments. Leave new

  • Thomas Pullen
    May 7, 2026 2:09 pm

    Brent, I love your blog, but how about a dark mode knob? IYGM 😉

    Reply
  • Eric Russell
    May 7, 2026 2:28 pm

    This is what I have done for one data archival project where the requirement is exactly as you described:
    “We want to archive, but we still want to query the data.”

    The server in question is more of a data mart for ad-hoc reporting. What I did is create a scheduled job that periodically executes a PowerShell script to export data > X months from on-prem tables to parquet files (leveraging open source parquetsharp library) and AzCopy to upload to an Azure cold storage container. The archived dataset is then deleted (after some queries that confirms the data was successfully copied).

    Each Az folder contains historical data for a specific table, and I created Polybase external tables linked to each folder. Users can query the base tables if they only need current data, or they can query from views that unionize on-prem tables with historical their respective external table.

    It’s a cheap and functional solution if you only need occasional access to the historical data, but it’s admittedly a high latency option. To improve performance, you can stack something like DataLake or Azure Synapse on top of the storage container, but that costs more money. Also, Fabric can do all this for you end to end, but again more money.

    Reply
  • I only had this “archive” problem in one place, but to get filtering to work of linked server consistently I’ve had to resort to either dynamic sql openquery or nested stored procedures.

    I’ve worked at other places where the “archive” was just a partitioned view all in the same database and its point was to stop the at the time crappy reporting procedures that almost without exception, escalated to table lock every time, from escalating into the hot part of the table.

    Reply
  • justa sqlguy
    May 7, 2026 4:09 pm

    i remember there was a feature in sql server that did just that and the beta tested really well. Make some config changes, grant some permissions and you’re off to the races. Cold data in azure, with compute (supported projections and predicate pushdown iirc) so it’s not always a full download with each query. No app change was killer. Too bad greed got in the way and they priced it stupid. Everybody I talked to about it during beta loved it. Nobody deployed in prod after launch because of the price. Good old days when microsoft was MBA-ed to death.

    Reply
  • In my opinion, it’s always important to go back to basics. People often confuse what they want with what they need. Most of the optimizations I undertake begin with figuring out what’s really going on, recognizing that clients often come with already thought-out versions of the solution, trying to lull us into thinking we’re in the dark.
    In a sense, this is natural, as is the case here, where the discussion is moving towards proposing various solutions. Trying to understand the real needs, the peace that less data provides, determining what those truly necessary data are, the purpose of the transaction system and how it also needs balance and its place in the ecosystem is not an easy task, but it is worth the effort.

    Reply
  • Simon Holzman
    May 7, 2026 4:43 pm

    Would using Views work ?

    Have a view just looking at the current year’s data where the data is stored in year based tables so it just queries the current year table, and another one (possibly in a different database) performing a UNION on all the annual tables.

    Or is this just me trying to finagle a less functional version of table partitioning ?

    Reply
  • TheTrueSolutionIsAlwaysInTheComments
    May 7, 2026 5:03 pm

    Or just write into your terms and service agreements in tiny print somewhere about 2/3’s of the way down where no one will read it, that data will only be retained for a year and purge that year old data like its infected every weekend.

    Reply
  • Thomas Franz
    May 7, 2026 5:10 pm

    I can’t fully agree with “As data ages, and you wanna move it to cheap/slow storage, that’s a row-by-row logged & locked operation. ” in the partitioned table part.

    When you need to do this, you didn’t plan in advance. I create a new set of filegroups / files every year that are used as target for the new partitions.

    To move the (c)old data onto a slower storage, I
    * set the filegroup to read_only
    * copied the file(s) to the new location (could take several time for larger databases)
    * run something as ALTER DATABASE MODIFY FILEGROUP (name = my_db_2024, filename = d:\slow_mountpoint\my_db\my_db_2024.ndf) to indicate that the file will be on that location later
    * took the database offline / online (to use the new location)
    * deleted the no longer used file at the old location
    * optional: set the filegroup to read_write again

    You will need 3 times a very short exclusive lock on the database (to set the filegroup read_only, to set the db to offline and to set the filegroup to read_write again). The long/slow process of copying happens when the db is available (just ensure, that nothing tries to change some of the old data, it will fail)

    Reply
    • Thomas – that’s an interesting technique, but yeah, 3 exclusive locks on the entire database, plus the complexity of doing this with Availability Groups, database mirroring, log shipping, etc – that’s a lot to do there.

      Reply
  • Robert Sterbal
    May 7, 2026 6:25 pm

    This is one case where I would argue for Brent’s setup and would recommend using replication to do transfer the data in something almost real time. This particular task is one of the rare tasks that I would recommend using replication to do.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.