SELECT…INTO is one of my favorite SQL Server features.
It’s great for creating table skeletons with false WHERE clauses (1=2), moving a real table to a staged/temp location for testing, etc.
In SQL Server 2014
It acquired the ability to go parallel, which is pretty neat, but that’s not what we’re talking about here.
It has some limitations
Chief among them is this:
Indexes, constraints, and triggers defined in the source table are not transferred to the new table, nor can they be specified in the SELECT…INTO statement. If these objects are required, you can create them after executing the SELECT…INTO statement.
Which is… Sort of true. There’s a trick, and I’ll show you a quick example here with another of my favorite things: a Numbers table.
;WITH E1(N) AS ( SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL ), E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j), Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2) SELECT [N].[N] INTO #NumbersTest FROM [Numbers] N ALTER TABLE #NumbersTest ADD CONSTRAINT [PK_Numbers] PRIMARY KEY CLUSTERED (N) WITH (FILLFACTOR = 100)
Trying to add the PK constraint here fails, because the column is NULLable
Msg 8111, Level 16, State 1, Line 37
Cannot define PRIMARY KEY constraint on nullable column in table '#NumbersTest'.
Msg 1750, Level 16, State 0, Line 37
Could not create constraint or index. See previous errors.
We can verify this by looking at the table metadata:
SELECT [columns].[name], [columns].[is_nullable] FROM tempdb.sys.columns WHERE [object_id] = OBJECT_ID(N'tempdb..#NumbersTest');
So how do we fix this? We could alter the table, but that won’t leave us with the lasting satisfaction of proving BOL wrong on a technicality. We’ll adjust our code a bit, and try again.
;WITH E1(N) AS ( SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL ), E2(N) AS (SELECT NULL FROM E1 a, E1 b, E1 c, E1 d, E1 e, E1 f, E1 g, E1 h, E1 i, E1 j), Numbers AS (SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM E2) SELECT ISNULL([N].[N], 0) AS N /* <--- The magic is here! */ INTO #NumbersTest_IN FROM [Numbers] N ALTER TABLE #NumbersTest_IN ADD CONSTRAINT [PK_Numbers] PRIMARY KEY CLUSTERED (N) WITH (FILLFACTOR = 100)
This time, with the addition of an ISNULL check on the column, it ‘inherits’ the not NULLable property, and the PK constraint adds successfully. We can verify that by checking the table metadata, if you don’t believe me:
Note that this same behavior does not occur if you replace ISNULL() with COALESCE()
And with that, you can clean up the test tables. Or keep them around. Do some math. Whatever blows your hair back.
Got a favorite use for a numbers table outside of string splitting or doing date math? Let me know in the comments! I may write a follow up.
DROP TABLE [#NumbersTest] DROP TABLE [#NumbersTest_IN]
Brent says: Wow. That is a really slick trick.
Kendra says: I know some people have bias against SELECT INTO because it seems sloppy and you have to go the extra mile to get the right types, but it can be a great tool. Don’t rule it out.
Jeremiah says: I really like this method – it’s a quick way to copy tables without using other cumbersome techniques
Have you ever wished your SQL Server could have an identical twin, holding the same data, in case you ever needed it? SQL Server mirroring provides just that, and you can choose if you want it for high availability or disaster recovery. If you’ve ever been curious about what mirroring really is, and what the pros and cons are, this is the session for you.
From the Dept of Corrections: During the webcast a viewer asked in Q&A if automatic page repair was a one way or two way street. Kendra answered that if the mirror gets an IO error it will go into a suspended state. This is somewhat correct but incomplete– the mirror will also try to correct the issue with a page from the principal afterward and attempt to resume mirroring. More info here.
There’s a secret to learning about SQL Server’s table partitioning feature: you need to get your hands on some code and really play with it in a test environment.
In this tutorial, Kendra will walk you through test scripts that set up a sample table partitioning environment. These scripts demonstrate gotchas to look out for and the cool features that can dramatically change your database’s performance.
Want to run the demos yourself? Download Table Partitioning Tutorial: Scripts.
Part 1: Our Mission
How can you use the demo and scripts from today in the best way possible? Find out here. 3.5 minutes
Part 2: How to create your test database and helper objects
It’s demotime! We create a test database with views that will help us quickly see critical information about out partitioned object. 3.5 minutes
Part 3: How to create partitioned objects
Next up we create a partition function. We create filegroups and files, and use a partition scheme to map the filegroups. We then create a partitioned object and indexes. Our helper views (created in Part 2), help us make sure things went as expected. 15 minutes
Part 4: Switch IN! How to use partition switching to add data to a partitioned table
Now for the cool stuff. In this session we explore how partition switching can allow us to snap a pile of data quickly into a partitioned table– and a major gotcha which can derail the whole process. 12 minutes
Part 5: Switch OUT! How to move data out of a partitioned table and merge boundary points
Deleting data is never fun… unless you’re switching data out of a partitioned table. But make sure to mind those best practices. In this section we show how to remove a lot of data from a partitioned table quickly and safely. 8 minutes
Part 6: Table partitioning takeaways
What are the big things we covered that you need to remember? Make sure you got all the important points here. 3 minutes
The good news: we’ve got tons more information on table partitioning at http://brentozar.com/go/partitioning
The bad news: things get even more complicated than you see in this post. If you’re looking at switching out and merging partitions, before you get started on that read my post here on best practices and merging boundary points.
Looking for the scripts? Download Table Partitioning Tutorial: Scripts.
See http://BrentOzar.com/go/eula for our End User Licensing Agreement which governs use of the scripts.
A while back, Jes asked who’s taking your backups. Making sure you have good backups is important. How much thought are you giving to handling historical backups? Right now, there’s a really good chance that you’re using a solution based on tape. While tape backups work, there’s a better way.
How Are You Archiving Backups Right Now?
Sending backups to tape isn’t the easiest process. For SQL Server, the process looks something like this: SQL Server backs up the database, the backup files are copied from a central location to a tape, on a regular schedule an administrator takes tapes out of the backup machine and sends them to an off-site facility. Doesn’t that sound like fun?
In addition to requiring that tapes need to be added and remove from a tape robot, magnetic tape also has the distinct disadvantage of requiring careful storage and handling to prevent damage to the storage media. There has to be a better way.
Offloading Backup Archives to the Cloud
Durable off-site storage is a must for a lot of businesses and when you don’t have requirements for physical media, I can’t think of a better option than using Amazon S3. Many companies are already making use of Amazon S3 to house durable off-site backups of data. S3 has the advantage of being durable and relatively highly available – the S3 SLA guarantees ten 9s of durability and four 9s of availability. For this privilege, we pay a pittance (between $0.05 and $0.13 per GB per month). And, let’s face it, that’s a cheap price to pay for being able to expand your archive capabilities on demand.
Amazon Glacier is a relatively new, low cost, durable storage solution. It looks a lot like S3 but has a distinct price advantage – Glacier costs $0.01 per GB per month. Glacier is built with long term storage in mind – storage is incredibly cheap but retrieval takes longer and costs more. When you need to retrieve data from Glacier you issue a request and Amazon will notify you when the data is available to download. Typically this takes a few hours, but it’s faster than getting tapes returned from off-site storage.
Automating the Archive Lifecycle
Until recently, putting data into Glacier required that administrators or developers create a set of scripts to push data into Glacier from S3 as it aged out. While this works, it’s still a manual step – if something happens to the server driving the data movement data won’t be copied. Earlier this week, Amazon announced support for automatic archiving into Glacier through lifecycle rules.
Lifecycle rules make it easy to automatically move files into Glacier based on a prefix and a relative or absolute timestamp. It’s easy to create groups of groups of backups and archive them on a daily basis. Rules can be even use to expire the files once they’ve been in Glacier for a fixed amount of time. Some businesses are required to keep backups, source data, or even older versions of the code base for a period of time – marking files for expiration makes it easy to comply with internal and external regulations.
Data lifecycle rules sound like they’re going to be painful to create, right? Thankfully, it’s incredibly easy to put one together. There’s only one step. In this example, files with a name beginning in “archive” will be archived to Glacier after 15 days and deleted from Glacier after 180 days.
What Does AWS Glacier Mean For Your Backups?
It probably doesn’t mean anything right now if you aren’t already looking at using AWS. The combination of S3 and Glacier gives DBAs and system administrators another set of options for keeping backups for long periods of time. Automating data motion removes the fallibility of human processes and physical media from the equation. It’s worth considering how you can improve your backup retention, reliability, and recoverability by automating storage of backups using S3 and Glacier.
Scaling up is hard: big hardware gets expensive fast. Scaling out is equally difficult; interesting design problems creep in to scale out solutions. One of the more troublesome issues architects face when scaling out is the issue of distributing identity. It’s often advantageous for object identity to be unique and portable across all database instances in an application – we may need to move a user from server A to server B. Once multiple database servers are involved, a centralized sequence generator can be come a single bottleneck or may even run out of values under heavy load. A solution is needed to generate unique values outside of the database while avoiding the performance problems of random, or semi-random, GUIDs.
Sequential Beginnings: Identity and IDENTITY
In the beginning, there was an empty schema. At this point an architect returned from the coffee pot and made a decision about database identities or GUIDs. There many valid reasons to make a decision in either direction – identities or sequences are controlled in the database and can be used to ensure a physical order to data. Who knew that spinning hard disks work better when data is sequential? (By the way, sequential GUIDs may not be sequential.)
You can make a lot of arguments for the right or wrong way to do things from a logical perspective, but DBAs do have a pretty good point when they say that randomness can cause problems for database performance. Generating sequential identifiers in the database may not be the most elegant solution to identity problems, but it does ensure that data is written in an order that makes sense in a world of spinning disk drives.
Database controlled sequential identity has one problem: sequential identities will need to be generated on a single server. Under sufficient load, that server will become a bottleneck for application scalability. To move past a single server as a bottleneck, a more robust and load tolerant solution is needed.
Distributing Identity: The Case for GUIDs
Architects and developers may be thinking that identities are great, but what happens when everything gets further apart?
As applications grow (or even by design), it becomes common to see teams become worried about distributing workload across many servers, using queues, or even splitting the data out into multiple databases or database servers. It’s at this point in the discussion that things get heated and people start throwing around the idea that GUIDs are the only way to solve this problem. Or that you just can’t rely on identity from the database and application generated identity is the only identity that matters.
This is where the war about numbers vs GUIDs gets nasty and someone gets their feelings hurt. Ignoring the size of GUIDs, I can say that I’ve witnessed several GUID collisions in production systems. GUIDs are only theoretically unique – they may even create a problem that you didn’t know you had.
A Better Solution for Distributed Identity
Combining distributed identity and ordered data seems like it’s a hard problem. Random GUIDs can’t be guaranteed to be unique, sequential GUIDs can’t be guaranteed to be non-overlapping, and database generated identities require a persistent connection to a database (or else they require a looser idea of identity than some folks are comfortable with).
Moving away from the database as the central keeper of all knowledge and identity is difficult and many teams seem to equate moving identity out of the database with moving all logic and functionality out of the database. This doesn’t have to be the case. The database can still be used to provide a considerable amount of declarative functionality and logic but identity generation can be moved outside of the database.
Twitter and Boundary have solved the problems of distributed sequencing by moving the work away from the data tier. Both solutions solve the problem by treating a number as if it were an array of information. The first portion of a sufficiently large number is a timestamp; the timestamp is stored as the number of milliseconds since a previous point in time. The next number is a worker identifier – this can be anything that uniquely identifies the device generating the sequence. Finally there’s a sequence itself. The sequence is typically small (between 8 and 16 bits) and it starts counting again from 0 every time the millisecond counter changes.
The machine identifier, usually the MAC address, doesn’t matter specifically, but we do need to be able to reliably generate separate sequences of IDs. This doesn’t have to be a MAC address, it could be any number of bytes that identify a unique source of sequences. By storing milliseconds since epoch as the first portion of the key, we’re able to produce a sequence that’s mostly ordered with some random jitter at the intermediate levels. On the whole, though, our inserts will be ordered.
If you’re on the .NET side of the house, I solved this problem with a library called Rustflakes. The implementation is lifted wholesale from Boundary’s flake and the generated sequence values are a .NET
decimal which lines up with SQL Server’s
DECIMAL data type – it’s nothing more than an ordered 128-bit number. Which, coincidentally, is the same size as a GUID.
Wrapping it Up
There’s no easy solution to this problem. GUIDs are an easier approach, but they introduce additional load and maintenance overhead on the data storage mechanism. Distributed sequences don’t solve all of the problems of GUIDs, but they provide additional flexibility for database administrators, developers, and application architects alike.
Strong development practices don’t spring up overnight; they take time, effort, and teamwork. Database development practices are doubly hard because they involve many moving pieces – unit testing, integration testing, and deploying changes that could have potential side effects beyond changing logic. In this session, Microsoft SQL Server MVP Jeremiah Peschka will discuss ways users can move toward a healthy cycle of database development using version control, automated testing, and rapid deployment.
Tools Mentioned in the Webcast
- SQL Server Alerts – Having basic SQL Server alerts in place can give you warnings of possible hardware failure.
- tSQLt – A database unit testing framework for SQL Server.
- Fixed database roles and user-defined database roles help define granular security.
- Many libraries exist for performing database migrations – they’ll typically be bundled with an ORM. Even Entity Framework supports database migrations!