In the last episode, I talked about knowing when it’s time to scale out: identifying when our data is getting to be so large that we have to split it across multiple servers, and I explained why that is so challenging. But what about table partitioning – SQL Server’s ability to break up a single table into smaller ones on the same server?
Great volumes have been written about table partitioning. It’s a complex feature and you can read for days to just understand how you might apply it. But will it improve performance for you? Table partitioning produces great benefits for some applications, but causes giant headaches for others.
What Table Partitioning Does
Table partitioning allows tables or indexes to be stored in multiple physical sections: a partitioned index is like one large index made up of multiple little indexes. Each chunk, or partition, has the same columns – just a different range of rows. In theory, SQL Server handles this transparently for you: a partitioned heap, clustered index, or non-clustered index can be referenced as a single structure although it’s stored in independent physical partitions. In other words, in theory you don’t need to change any code in the calling applications. (Is this true in reality? More on this later.)
Let’s say we want to partition the Stack Overflow Posts table – that’s where the questions and answers are stored. Every question & answer is considered a Post. I’m going to use the 50GB 2013 version of the Stack Overflow database because the Posts table is starting to grow: it’s 17,142,169 rows taking up 37GB space on disk – that’s most of the database, so breaking it up will be a performance boost, right?
Let’s say we want to partition them by CreationDate, and we want to store each year’s data in a separate file. We’ll need to start by creating filegroups for each year:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
USE [master] GO ALTER DATABASE [StackOverflow2013] ADD FILEGROUP [History_2008] GO ALTER DATABASE [StackOverflow2013] ADD FILE ( NAME = N'History_2008A', FILENAME = N'M:\MSSQL\Data\History_2008A.ndf' , SIZE = 524288KB , FILEGROWTH = 524288KB ) TO FILEGROUP [History_2008] GO ALTER DATABASE [StackOverflow2013] ADD FILEGROUP [History_2009] GO ALTER DATABASE [StackOverflow2013] ADD FILE ( NAME = N'History_2009A', FILENAME = N'M:\MSSQL\Data\History_2009A.ndf' , SIZE = 524288KB , FILEGROWTH = 524288KB ) TO FILEGROUP [History_2009] GO ALTER DATABASE [StackOverflow2013] ADD FILEGROUP [History_2010] GO ALTER DATABASE [StackOverflow2013] ADD FILE ( NAME = N'History_2010A', FILENAME = N'M:\MSSQL\Data\History_2010A.ndf' , SIZE = 524288KB , FILEGROWTH = 524288KB ) TO FILEGROUP [History_2010] GO ALTER DATABASE [StackOverflow2013] ADD FILEGROUP [History_2011] GO ALTER DATABASE [StackOverflow2013] ADD FILE ( NAME = N'History_2011A', FILENAME = N'M:\MSSQL\Data\History_2011A.ndf' , SIZE = 524288KB , FILEGROWTH = 524288KB ) TO FILEGROUP [History_2011] GO ALTER DATABASE [StackOverflow2013] ADD FILEGROUP [History_2012] GO ALTER DATABASE [StackOverflow2013] ADD FILE ( NAME = N'History_2012A', FILENAME = N'M:\MSSQL\Data\History_2012A.ndf' , SIZE = 524288KB , FILEGROWTH = 524288KB ) TO FILEGROUP [History_2012] GO ALTER DATABASE [StackOverflow2013] ADD FILEGROUP [History_2013] GO ALTER DATABASE [StackOverflow2013] ADD FILE ( NAME = N'History_2013A', FILENAME = N'M:\MSSQL\Data\History_2013A.ndf' , SIZE = 524288KB , FILEGROWTH = 524288KB ) TO FILEGROUP [History_2013] GO |
You’ll notice in this example that I’m using the same drive (M) for all of the files – in partitioning, though, it’s common to put the different years of data in different drives, and even stripe the data across multiple files. Let’s keep it simple for this demo though.
Next, I need to create a partition function and a partition scheme to tell SQL Server which data belongs on which drive/file:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
USE StackOverflow2013; GO CREATE PARTITION FUNCTION pf_CreationDate ( DATETIME ) AS RANGE RIGHT FOR VALUES ( '20090101', '20100101', '20110101', '20120101', '20130101' ); CREATE PARTITION SCHEME ps_CreationDate AS PARTITION pf_CreationDate TO (History_2008, History_2009, History_2010, History_2011, History_2012, History_2013 ); |
Partitioning pros will note that I’m violating all kinds of best practices. Partitioning pros should also note that I’m trying to teach the basics of partitioning in a really short amount of space. Partitioning pros should go write their own blog posts.
Finally, we move the table over onto the new storage structure:
1 2 3 4 5 6 |
/* Have to drop the existing clustered index before we can add a new one: */ ALTER TABLE dbo.Posts DROP CONSTRAINT PK_Posts_Id; /* Create a new partitioned one - note for now it's not the primary key: */ CREATE UNIQUE CLUSTERED INDEX CLIX_CreationDate_Id ON dbo.Posts(CreationDate, Id) ON ps_CreationDate(CreationDate); |
To pull that off, I had to drop the primary key.
This table – like most tables – has a clustered primary key, and SQL Server doesn’t let me just move the primary key over to the new storage structure. Instead, I have to:
- Remove any foreign key relationships that reference this table
- Drop the primary key
- Create a new clustered index on the partition scheme (which splits the data across multiple files)
- Create a primary key again
- Recreate the foreign key relationships
For more details about why I had to do all that, see this excellent Stack answer by Remus Rusanu, and make sure to read the linked Microsoft article, Special Guidelines for Partitioned Indexes.
See, partitioned indexes come with a whole lot of gotchas:
- What I just did would have involved a lot of transaction log activity and ugly downtime
- A real-life implementation requires reading the documentation (and it’s big)
- You really need to load data with partition switching
- To do that, your nonclustered indexes need to be aligned (use the partition scheme)
- Fixing non-aligned indexes requires dropping & recreating them
- When handling concurrency, locking on partitioned tables is tricky
- Merging & switching partitions can have surprise effects
- If the WHERE clause of your query doesn’t include the partitioning key, query performance usually gets worse
<sigh>
Partitioned tables work, but…they’re a lot of hard work.
So should you partition your tables?
If you’re dealing with a 1TB+ data warehouse that can do sliding window loads, possibly.
If you’re dealing with a smaller one, or you can’t do sliding window loads, partitioned views are usually a better fit.
If you’re doing transactional workloads, and you can’t guarantee that the partitioning key will be in most of your WHERE clauses, then no, partitioning isn’t a good fit. Plain ol’ index and query tuning will be a better match.
11 Comments. Leave new
I’ve only come across two real-world situations where partitioning actually made sense. It’s a TON of work, but in the end it significantly reduced my query times for that specific table.
Both times involved a data warehouse with data spread over decades. Both tables were over a T of data. All queries within those specific tables utilized the date fields in the where clause. Query times dropped by orders of magnitude and partitioning was ultimately worth the price of admission, but it’s sure a pain when dealing with backups and restores now.
I partition the index in the ReportServer catalog that stores the blob data for PowerBI reports and then store the file group on a different volume. Wish MS would let you put the blob data for the olap data sources in another database…
@Keith: what performance benefit does this provide?
I’ve also done this a few times. e.g. Using a smart insert/update/delete trigger. It works pretty good and can easily reduce performance issues. With the amount of big data out there today, this is an essential.
Thank you for the excellent post!
A common misconception is that partitioning is mostly intended for performance optimization. Actually, the most useful application of partitioning is for easy archiving of large sets of data (i.e. switch-out & switch-in of partitions).
Thanks for making me crack up with this paragraph: “Partitioning pros will note that I’m violating all kinds of best practices. Partitioning pros should also note that I’m trying to teach the basics of partitioning in a really short amount of space. Partitioning pros should go write their own blog posts.”
🙂
On SQL Server 2017 Std, I have an SSIS package that runs once a day on a 24×7 mission critical operational database to archive the largest transactional tables. The data is moved to tables on a separate archive database and the process runs smoothly with minimum locking contention on the live db (deletes are spread through a long period of time). Having a separate archiving database gives us many advantages: backup databases separately; refresh testing environments with only current data; improve query performance on current data, amount others. The only drawback was code change (to add logic to union on archived data when requested by the user on the UI) and a slower cross-database query when needing to fetch archived data.
We are upgrading to Enterprise edition and considering table partitioning for data archiving, but the more I read about it the more worried I get in relation to the implementation and maintenance overhead. The only real advantage I can see so far is the transparency to the application, not requiring code changes.
Did any of you use table partitioning for data archiving? Is it really worth it on OLTP systems? I got the point mentioned above about doing it for OLAP. Thanks very much in advance.
Marcelo – for general questions, your best bet is a Q&A site like https://dba.stackexchange.com or SQLhelp on Twitter: https://www.brentozar.com/go/sqlhelp
Thanks million for the tip! 😉
Can i suggest a 4th example here?
“So should you partition your tables?
If you’re dealing with a 1TB+ data warehouse that can do sliding window loads, possibly.
If you’re dealing with a smaller one, or you can’t do sliding window loads, partitioned views are usually a better fit.
If you’re doing transactional workloads, and you can’t guarantee that the partitioning key will be in most of your WHERE clauses, then no, partitioning isn’t a good fit. Plain ol’ index and query tuning will be a better match.
”
If you’re dealing with a 1TB+ OLTP with transactional workload, and you can no longer maintain your indexes due to sheer size, and you don’t have an archiving solution just yet, then maybe…
I’m in the process of copying billions of rows over to a partitioned version of a production table on a 6Tb database, without downtime (cos i don’t have that option), in order to have maintainable indexes.
We need unicity and the partitioned column isn’t part of the unique columns, so switching in/out wont be possible.
Burying our head in the sand didnt make it go away, so we’ve gone for partitioning.
The DataWarehouse part will be offloaded to an Azure ‘copy’ in the nearish future so i’m not too concerned about WHERE clauses and partition elimination.
Thank you for your dedication and fun value while sharing this knowledge. It makes our job that much more fun and really, really helps!