When SQL 2005 was in beta, I clapped my hands and jumped for joy the first time I played with partitioning. I’ve done a lot of data warehouse work, and partitioning lets the DBA break down massive tables into smaller, easy-to-manage chunks. We can put heavily accessed parts of the table onto fast storage, and less-accessed data onto slower, cheaper storage. The cool part is that it’s completely transparent to applications: they don’t have to know the table is even partitioned.
We implemented it almost as soon as SQL 2005 came out, and I kept wondering why more people didn’t take advantage of this feature. Sure, it’s not particularly intuitive, and there’s no GUI for it in SQL Server Management Studio, but it does work.
“Can partitioning help us?”
Fast forward to yesterday: I met with Mark Schmulen and David Lyman of NutshellMail.com. It’s a slick solution for those of us who work behind corporate firewalls that prevent us from accessing personal email sites like Gmail and Yahoo, as well as personal networking sites like Facebook and Myspace. When I worked for JPMorgan Chase, I was frustrated that I couldn’t get my personal email because I had a network of DBA buddies who all knew and used my personal email address. We would ask each other for support from time to time, and I liked being able to help out. At JPMC, though, the only way I could do that was contact all my friends and have them switch to using my corporate email – which I couldn’t check when I was outside of the office, so that was a lose-lose situation for me. NutshellMail gets around that by checking your personal email accounts on a schedule you select, and then sending you digest emails to your work address. Makes perfect sense, and I’d use it in a heartbeat – even if they weren’t using SQL Server for storage. (Bonus!)
We talked scalability, and they’d heard partitioning may be a good answer for future growth. I was pleasantly surprised to hear that the word about partitioning is getting out, and that was the first time I said no, partitioning isn’t the right answer for this one particular case. Having talked through that, I wanted to touch base on some of the reasons here.
Partitioned tables require Enterprise Edition.
Standard Edition can’t do partitioning, and Enterprise is a few times more expensive than Standard. On a 2-CPU box, we’re talking about a $30-$50k difference in costs. If we’re only using Enterprise for the partitioning, then there’s a better return on investment if we put that $30-$50k into faster storage instead. This is especially relevant for servers with direct attached storage, even in a raid 10 config. I wouldn’t consider partitioning unless the server had at least 10-20 drives for the data array alone.
Partitioning best practices means using both fast and slow storage.
In the case of a data warehouse, a sales table is loaded every night with the previous day’s sales data. The most recent data is under heavy load because it’s constantly being accessed by both the ETL processes and the daily reports – because users really want to know what happened in the last few days to a month. We have to keep years and years of sales data on hand for curious users who want to check trends, though, and we don’t want to keep all of that data on extremely fast (and extremely expensive) storage. (Yes, partitioning gurus, I know there’s work involved in moving data from one array to another, but sometimes it’s worth it.) If we partitioned by date range, we could keep the new data on fast storage, and shuffle the older data off to slower storage.
Partition functions and schemes are easy to design incorrectly.
Before implementing partitioning for the first time, either get personalized advice from a DBA who’s done it several times, or get an identical development system for repeated testing. I had the luxury of an identical QA environment where I could repeatedly test different partitioning strategies and replay the end user load to see which strategy performed best. Not everybody’s that lucky (come to think of it, I should have played the lotto more) and it’s a mistake to try it on a production system first. When I first started using it, I thought I knew exactly the right strategy for our data, and we’re talking about a schema I knew forwards and backwards. I was wrong – really wrong – but at least I found the right answer before going into production.
In summary, outside of data warehouses, I like to think of partitioning as the nuclear bomb option. When things are going out of control way faster than you can handle with any other strategy, then partitioning works really well. However, it’s expensive to implement (Enterprise Edition plus a SAN) and you don’t want to see it in the hands of people you don’t trust.
If you’re considering partitioning, you might want to check out…
Partitioning is just one way to get performance out of your large database. Here’s a few more:
- SAN Multipathing – this rarely-covered concept is the key to getting fast performance out of a SAN.
- Should SQL Server Use Shared Drives? – a reader wrote in asking how to configure his SAN for SQL Server.
- Steel Cage Blogmatch: How to Configure a SAN – Jason Massie and I debate the shared-vs-dedicated-disks question.
Great post, Brent. I’ve been considering date-range partitioning one extremely large (28% of db) transaction table on an OLTP. We have no performance issues at present, but I was hoping to reduce the weight of the backups.
Thanks, sir! You mentioned something that I happen to be working on: I’m going to do a blog post on filegroup backups with partitioning. You’re exactly right that it’ll make backups faster with date-range partitioning based on dates where rows are inserted.
Awesome! Looking forward to reading the post.
Hello Brent. I'm at the last reason you posted….. and truly hoping not to do it wrong. I agree – partitions are fabulous in terms of performance and usability of VLDB's! But, I haven't touched them hands-on since early v2000 (very problematic, very high maintenance). Right now, I am trying to construct a daily partition strategy for v2008, where the data is massive and more than warrants it. Daily partitions… I'm reading everything i can get my hands on; ie., 'best practices suggest as many filegroups as you have partitions'…. curious, do you know where I may find an example of a daily strategy (function, scheme, etc.)?
Hi! The best one I know isn't out yet – Ron Talmage and crew are writing a whitepaper for Microsoft on it, and it's supposed to be out pretty soon, but keep your eyes peeled for it. I'll blog about it the instant it hits virtual newstands.
Thanks, Brent. I will keep my eyes open…. in the interim, let's hope she doesn't trip up too badly.
Fortunately, i've got two high-end resources side by side… one for prod, and one to play in. Unfortunately, i need to deploy asap. cross your fingers…. and pls let me know if you hear of anything.
Thanks again.
I'm being amazing for useing partitioning for my case or not I must to store about 1000 record per a second this data is about location of mobile nodes, these data make my database too huge do you think i must partitioning my databa
se or not(I have so much reporting in future).
Another reason to make use of partitioning is to take advantage of Microsoft’s cloud (SQL Azure), which has a 10GB per partition limit (as at Dec 2009).
Partitioning seems to be a core requirement actually, to using SQL Azure as I’d bet that most production databases for people reading your blog will be way more than 10GB in size. And in the cloud scenario the licensing cost for SQL Server Enterprise Edition is no longer an issue.
Also, you’ve highlighted the need from a disk performance perspective, but it’s nice as a way of implementing some sort of data archiving over time without burdening the application layer.
Kevin – ouch, sorry, it doesn’t work that way. SQL Azure doesn’t support SQL Server partitioning. You have to design differently – you have to do sharding. Sharding and SQL Server partitioning are similar in theory, but not in execution. You can’t code your app the same way to take advantage of both.
In addition, it’s not really useful for “some sort of data archiving” because as of the current SQL Server version, you can’t back up partitions separately. In addition, unless you specify the partitioned dates in the where clause, SQL still has to hit all of the partitions (current and archive), so you don’t really gain anything there either.
Thanks for your comment though!
Our gigantic table doesn’t have a DATE column, but rather an INTERVAL column that is joined with an INTERVAL table for the date. Any thoughts on an effective partitioning scheme for our DATEless gigantic table? (e.g. Partitioning on ranges of values in and IDX column?)
Thanks in advance!
Sure, absolutely, you can partition on the interval column using ranges. Your best bet is to try several different partitioning schemes/functions and benchmark each of them with your own query loads. That way you’ll find out what works best for your own queries. SQL Server’s engine doesn’t always do a good job of partition elimination based on the query, and your results will definitely vary, so internal testing is the way to go.
Hi Brent,
A wonderful article on partitioning…Continuing in the same vein, I have a few questions that you can throw some light on, or offer suggestions as necessary:
Scenario:
I have partitioned 5 large tables across 10 partitions (with 10 filegroups) and with a set of 5 – 6 users the performance seems to be adequate, with minimal conflicts (like deadlocks etc), but I want to implement this scheme that can scale to say 300+ users. With regard to this I have the following questions:
– What is the maximum number of partitions that works optimally? In other words, even though SQL Server allows 1000 partitions per table, it would become an administrative nightmare were we to deploy such a scheme.
Degradation Concerns:
– What would be the maximum (but practical) number of partitions that one can deploy before one has to worry about degradation in performance and concurrency?
– If the number of users access the same partition, what would be the negative impact on the overall scheme (in terms, again of performance)?
– Does a multi-CPU environment enhance the overall performace? Of course, we can suggest to the client to use RAID based storage with data striped across multiple drives?
– Any other aspects related to partitioning that can help me for implementation?
MORE DETAILS that might help you hone on the answers if any :
1. The largest table holds about 50 million rows as of now (but this is only a test db..) In other words it can grow to a billion.
2. As of now I have about 10 partitions defined out of which only the first 6 are occupied. The number 10 is not a holy number or restrictive in any sense. I have just chosen this number to be at a manageable level.
3. The data density in other 4 tables can vary from 1 – 2 million.
4. Since the tables involved participate in JOIN operations, I have currently used the same partition function for all and the data distribution is based on a ‘derived’ partition_column called partition number. This way the associated tables are in the same partition. Each partition is associated with one filegroup that comprises of many files.
5. I have also aligned the indexes for the tables across the same partitions.
6. As for the available hardware, I need to contact the client as to what they have and whether they can upgrade if needed. I needed a ballpark infrastructure that I can use to recommend. In other words, at this juncture, I cannot state with confidence how powerful their servers are, but given a recommendation about the requirements, I am sure they would comply. I DO know that their servers are multi-CPU.
7. As for the RAID configuration to be suggested, would RAID 1 and RAID 10 serve the purpose? Both configurations conform to redundancy
Does this information suffice or do I need to provide more?
Please come back with your suggestions or questions and I shall try to provide more information if needed. While I can find several articles of partitioning (even from Microsoft MVPs), I would like answers in particular, to address my concerns expressed above.
With best regards,
Venky
Venky – hi! Boy, have you got a lot of good questions there. Here’s the thing – this would require way more work than I can put into a comment. You’re asking for a level of detail that’s usually done by a consultant rather than put onto a public web page. If you’d like to talk to a consultant, I can give you a few good references if you’re interested. Drop me an email at brento@brentozar.com if that’s something you’d like to pursue.
Hey Brent,
All the articles I have read on table partitioning use terms like “huge tables”, or “large tables”, but I have never read a definition of what a huge or large table is. Do you have any suggestions on data size or row count minimums on when to start considering partitioning a fact table? I manage a data warehouse with a handful of fact tables ranging from 5-10 million rows, (data size 5-8 GB each). I don’t consider these to be very large and definately not huge, and have never found very clear answers on whether or not tables of this size would benefit much from partitioning.
Thanks for any advice you may have,
Adam Sottosanti
Adam – good question. The Wikipedia entry on very large databases (VLDBs) suggests that it’s over 1 terabyte or several billion rows. When I’m working with databases and I see a table with over 10 million rows, I want to start paying closer attention to storage strategies, indexing strategies, and recovery strategies. 10 million rows isn’t much at all these days – but the thing is, you want to have your strategies in place before the table grows to 100 million rows. At 100 million rows, it’s still relatively easy to change strategies, but you want to have experience before you get there. At a billion rows, it’s too late to start thinking about these strategies – they need to be solidly in place.