There’s a secret to learning about SQL Server’s table partitioning feature: you need to get your hands on some code and really play with it in a test environment.
In this tutorial, Kendra will walk you through test scripts that set up a sample table partitioning environment. These scripts demonstrate gotchas to look out for and the cool features that can dramatically change your database’s performance.
Want to run the demos yourself? Download Table Partitioning Tutorial: Scripts.
Part 1: Our Mission
How can you use the demo and scripts from today in the best way possible? Find out here. 3.5 minutes
Part 2: How to create your test database and helper objects
It’s demotime! We create a test database with views that will help us quickly see critical information about out partitioned object. 3.5 minutes
Part 3: How to create partitioned objects
Next up we create a partition function. We create filegroups and files, and use a partition scheme to map the filegroups. We then create a partitioned object and indexes. Our helper views (created in Part 2), help us make sure things went as expected. 15 minutes
Part 4: Switch IN! How to use partition switching to add data to a partitioned table
Now for the cool stuff. In this session we explore how partition switching can allow us to snap a pile of data quickly into a partitioned table– and a major gotcha which can derail the whole process. 12 minutes
Part 5: Switch OUT! How to move data out of a partitioned table and merge boundary points
Deleting data is never fun… unless you’re switching data out of a partitioned table. But make sure to mind those best practices. In this section we show how to remove a lot of data from a partitioned table quickly and safely. 8 minutes
Part 6: Table partitioning takeaways
What are the big things we covered that you need to remember? Make sure you got all the important points here. 3 minutes
We’ve got tons more information on table partitioning at http://brentozar.com/go/partitioning
Looking for the scripts? Download Table Partitioning Tutorial: Scripts.
See http://BrentOzar.com/go/eula for our End User Licensing Agreement which governs use of the scripts.
A while back, Jes asked who’s taking your backups. Making sure you have good backups is important. How much thought are you giving to handling historical backups? Right now, there’s a really good chance that you’re using a solution based on tape. While tape backups work, there’s a better way.
How Are You Archiving Backups Right Now?
Sending backups to tape isn’t the easiest process. For SQL Server, the process looks something like this: SQL Server backs up the database, the backup files are copied from a central location to a tape, on a regular schedule an administrator takes tapes out of the backup machine and sends them to an off-site facility. Doesn’t that sound like fun?
In addition to requiring that tapes need to be added and remove from a tape robot, magnetic tape also has the distinct disadvantage of requiring careful storage and handling to prevent damage to the storage media. There has to be a better way.
Offloading Backup Archives to the Cloud
Durable off-site storage is a must for a lot of businesses and when you don’t have requirements for physical media, I can’t think of a better option than using Amazon S3. Many companies are already making use of Amazon S3 to house durable off-site backups of data. S3 has the advantage of being durable and relatively highly available – the S3 SLA guarantees ten 9s of durability and four 9s of availability. For this privilege, we pay a pittance (between $0.05 and $0.13 per GB per month). And, let’s face it, that’s a cheap price to pay for being able to expand your archive capabilities on demand.
Amazon Glacier is a relatively new, low cost, durable storage solution. It looks a lot like S3 but has a distinct price advantage – Glacier costs $0.01 per GB per month. Glacier is built with long term storage in mind – storage is incredibly cheap but retrieval takes longer and costs more. When you need to retrieve data from Glacier you issue a request and Amazon will notify you when the data is available to download. Typically this takes a few hours, but it’s faster than getting tapes returned from off-site storage.
Automating the Archive Lifecycle
Until recently, putting data into Glacier required that administrators or developers create a set of scripts to push data into Glacier from S3 as it aged out. While this works, it’s still a manual step – if something happens to the server driving the data movement data won’t be copied. Earlier this week, Amazon announced support for automatic archiving into Glacier through lifecycle rules.
Lifecycle rules make it easy to automatically move files into Glacier based on a prefix and a relative or absolute timestamp. It’s easy to create groups of groups of backups and archive them on a daily basis. Rules can be even use to expire the files once they’ve been in Glacier for a fixed amount of time. Some businesses are required to keep backups, source data, or even older versions of the code base for a period of time – marking files for expiration makes it easy to comply with internal and external regulations.
Data lifecycle rules sound like they’re going to be painful to create, right? Thankfully, it’s incredibly easy to put one together. There’s only one step. In this example, files with a name beginning in “archive” will be archived to Glacier after 15 days and deleted from Glacier after 180 days.
What Does AWS Glacier Mean For Your Backups?
It probably doesn’t mean anything right now if you aren’t already looking at using AWS. The combination of S3 and Glacier gives DBAs and system administrators another set of options for keeping backups for long periods of time. Automating data motion removes the fallibility of human processes and physical media from the equation. It’s worth considering how you can improve your backup retention, reliability, and recoverability by automating storage of backups using S3 and Glacier.
Scaling up is hard: big hardware gets expensive fast. Scaling out is equally difficult; interesting design problems creep in to scale out solutions. One of the more troublesome issues architects face when scaling out is the issue of distributing identity. It’s often advantageous for object identity to be unique and portable across all database instances in an application – we may need to move a user from server A to server B. Once multiple database servers are involved, a centralized sequence generator can be come a single bottleneck or may even run out of values under heavy load. A solution is needed to generate unique values outside of the database while avoiding the performance problems of random, or semi-random, GUIDs.
Sequential Beginnings: Identity and IDENTITY
In the beginning, there was an empty schema. At this point an architect returned from the coffee pot and made a decision about database identities or GUIDs. There many valid reasons to make a decision in either direction – identities or sequences are controlled in the database and can be used to ensure a physical order to data. Who knew that spinning hard disks work better when data is sequential? (By the way, sequential GUIDs may not be sequential.)
You can make a lot of arguments for the right or wrong way to do things from a logical perspective, but DBAs do have a pretty good point when they say that randomness can cause problems for database performance. Generating sequential identifiers in the database may not be the most elegant solution to identity problems, but it does ensure that data is written in an order that makes sense in a world of spinning disk drives.
Database controlled sequential identity has one problem: sequential identities will need to be generated on a single server. Under sufficient load, that server will become a bottleneck for application scalability. To move past a single server as a bottleneck, a more robust and load tolerant solution is needed.
Distributing Identity: The Case for GUIDs
Architects and developers may be thinking that identities are great, but what happens when everything gets further apart?
As applications grow (or even by design), it becomes common to see teams become worried about distributing workload across many servers, using queues, or even splitting the data out into multiple databases or database servers. It’s at this point in the discussion that things get heated and people start throwing around the idea that GUIDs are the only way to solve this problem. Or that you just can’t rely on identity from the database and application generated identity is the only identity that matters.
This is where the war about numbers vs GUIDs gets nasty and someone gets their feelings hurt. Ignoring the size of GUIDs, I can say that I’ve witnessed several GUID collisions in production systems. GUIDs are only theoretically unique – they may even create a problem that you didn’t know you had.
A Better Solution for Distributed Identity
Combining distributed identity and ordered data seems like it’s a hard problem. Random GUIDs can’t be guaranteed to be unique, sequential GUIDs can’t be guaranteed to be non-overlapping, and database generated identities require a persistent connection to a database (or else they require a looser idea of identity than some folks are comfortable with).
Moving away from the database as the central keeper of all knowledge and identity is difficult and many teams seem to equate moving identity out of the database with moving all logic and functionality out of the database. This doesn’t have to be the case. The database can still be used to provide a considerable amount of declarative functionality and logic but identity generation can be moved outside of the database.
Twitter and Boundary have solved the problems of distributed sequencing by moving the work away from the data tier. Both solutions solve the problem by treating a number as if it were an array of information. The first portion of a sufficiently large number is a timestamp; the timestamp is stored as the number of milliseconds since a previous point in time. The next number is a worker identifier – this can be anything that uniquely identifies the device generating the sequence. Finally there’s a sequence itself. The sequence is typically small (between 8 and 16 bits) and it starts counting again from 0 every time the millisecond counter changes.
The machine identifier, usually the MAC address, doesn’t matter specifically, but we do need to be able to reliably generate separate sequences of IDs. This doesn’t have to be a MAC address, it could be any number of bytes that identify a unique source of sequences. By storing milliseconds since epoch as the first portion of the key, we’re able to produce a sequence that’s mostly ordered with some random jitter at the intermediate levels. On the whole, though, our inserts will be ordered.
If you’re on the .NET side of the house, I solved this problem with a library called Rustflakes. The implementation is lifted wholesale from Boundary’s flake and the generated sequence values are a .NET
decimal which lines up with SQL Server’s
DECIMAL data type – it’s nothing more than an ordered 128-bit number. Which, coincidentally, is the same size as a GUID.
Wrapping it Up
There’s no easy solution to this problem. GUIDs are an easier approach, but they introduce additional load and maintenance overhead on the data storage mechanism. Distributed sequences don’t solve all of the problems of GUIDs, but they provide additional flexibility for database administrators, developers, and application architects alike.
Strong development practices don’t spring up overnight; they take time, effort, and teamwork. Database development practices are doubly hard because they involve many moving pieces – unit testing, integration testing, and deploying changes that could have potential side effects beyond changing logic. In this session, Microsoft SQL Server MVP Jeremiah Peschka will discuss ways users can move toward a healthy cycle of database development using version control, automated testing, and rapid deployment.
Tools Mentioned in the Webcast
- SQL Server Alerts - Having basic SQL Server alerts in place can give you warnings of possible hardware failure.
- tSQLt – A database unit testing framework for SQL Server.
- Fixed database roles and user-defined database roles help define granular security.
- Many libraries exist for performing database migrations – they’ll typically be bundled with an ORM. Even Entity Framework supports database migrations!