A Guide to Contributing Code

So you’ve got a great idea for a new feature to add to sp_BlitzSomethingOrOther, what’s the best way to get started?

The Documentation Says…

If you read our code contribution guidelines, you should write new code, write a test, sign an agreement, and then send us your code. That’s technically correct, but it’s a daunting task. After all – five people are going to be looking at your code and then thousands more might be looking at your code.

The documentation is technically correct (like most documentation), but it assumes a lot.

If you go by what the documentation suggests, we’ll definitely see your code, but there’s a decent chance that we’re not going to accept your code contribution.

Start Small

The best way to get started on any existing project is to start small. It’s rare to write a major feature on your first contribution to any new project or product – there’s a significant barrier to entry around code formatting and style, future features, and work in progress.

The best way to help out is to fix a bug in the code.

Right now you’re saying, “Fix someone else’s bugs? No way!”

Hear me out.

When you find and fix a bug in the code, you’re signaling a few things. The first thing you signal is that you have a better eye for detail than the moron who wrote the code. The second thing you signal is that you want to help that moron make their software a little bit better.

Build Trust and Understanding

Contributing small fixes to an existing code base goes a long way to establishing trust. It’s one of the ways that we all work together and bring new employees up to speed with our tools. We don’t throw each other into the deep end (much). Instead we get familiar with our own software tooling by looking for issues and fixing them. We build up trust in each other as we’re building up knowledge.

By fixing bugs, you’re building trust and establishing a working knowledge around a particular code base.

Beyond building trust, you’re also getting an understanding of how a particular piece of code is put together. As an example, sp_BlitzCache is riddled with dynamic SQL, XQuery, XPath, and strange DMV ordering. It’s all done for a reason, and that reason is performance. A few changes would take sp_BlitzCache from finishing in 30-90 seconds to finishing in 30-90 minutes – I should know, I’ve introduced those changes before.

As you’re in the code fixing bugs, you’ll spot places to add more features and functionality. This is a great place to strike up a conversation with the authors about adding those new features, or at least getting them on a roadmap.

Sometimes, we’re already working on a feature but we haven’t made anything about it public yet. You don’t want to spend hours writing a new feature only to see it come out in a completely different format. Building up that relationship of trust means we’ll be chatting with you about our ideas and you’ll be aware of our crazy ideas as they happen.

Code review is hard!

Code review is hard!

…But Test First

The best reason to start out by fixing bugs is that we have a very strange test set up. By testing your changes the same way we test our changes, you can rest assured that your changes will be accepted on their merit, and not rejected on a technicality.

We test our code changes on multiple versions of SQL Server and we use case sensitive instances. A simple mistake in a column name can stop a query from running, for some users, and we’d rather be safe than sorry.

Too Long; Didn’t Read

In short, the best way to get started contributing to sp_BlitzWhatever is:

  1. Find a bug.
  2. Fix the bug.
  3. Submit your fixes.
  4. Rinse. Repeat.
  5. Work up to implementing bigger fixes & features.

Get started today, head over to and pick out a bug that someone has found. Submit your ideas at

Introduction to the Oracle Data Dictionary

If you’re going to be working with Oracle, you need to be able to get a better handle on what’s going on with the Oracle database. Just like other database platforms, Oracle provides a data dictionary to help users interrogate the database system.

Looking at System Objects

Database administrators can view all of the objects in an Oracle system through the DBA_% prefixed objects.

You can get a list of all available views through the dba_objects system view:

/* There's a gotcha here:
   if you installed Oracle as suggested, you'll be using a
   case sensitive collation. That's not a big deal, just
   don't forget that while you don't need to capitalize object
   names in SQL*Plus, you do need to capitalize the names while
   you're searching.
FROM dba_objects
WHERE object_name LIKE 'DBA_%';

And the results:


Just over 1000 views, eh? That’s a lot of system views. If you just want to examine a list of tables stored in your Oracle database you can use the dba_tables view to take a look. Here we’ll look at the EXAMPLE database schema:

SELECT owner,
FROM   dba_tables
WHERE  tablespace_name = 'EXAMPLE'
ORDER BY owner,
       table_name ;

The curious can use the desc command to get a list of all columns available, either in the dba_tables view, or any of the tables returned by querying dba_tables.

User Objects

A user shouldn’t have access to the DBA_ views. Those are system level views and are best left to people with administrative access to a system. If a user shouldn’t have that level of access, what should they have? Certainly they should have access to their own objects.

Users can view their own data with the USER_ views. There’s a user_objects table that will show information about all objects visible to the current user. If you just want to see your own tables, you can use the user_tables view instead:

SELECT table_name,
FROM   user_tables ;

Of course, users may have access to more than database objects that they own. In these cases, users can use the ALL_ views to see everything that they have access to:

SELECT COUNT(DISTINCT object_name) FROM all_objects
SELECT COUNT(DISTINCT object_name) FROM dba_objects ;

Running this query nets 52,414 rows in all_objects and 54,325 in dba_objects. Clearly there are a few things that I don’t have direct access to, and that’s a good thing.

System Status with V$ Views

Oracle’s V$ views record current database activity. They provide insight into current activity and, in some cases, they also provide insight into historical activity. There are a number of dynamic performance views (Oracle’s term for the V$ views) covering everything from waits to sessions to data access patterns and beyond.

As an example, you can view all sessions on an Oracle database using the v$session view:

SELECT sid, username, machine
FROM v$session
WHERE username IS NOT NULL ;

Oracle has a wait interface, just like SQL Server. Waits are available at either the system or session level. The v$system_event view shows wait information for the life of the Oracle process. The v$session_event view shows total wait time at a session level (what has this process waited on since it started). You can look at currently running (or just finished sessions) using v$session_wait.

Using this, we can look into my session on the system with:

SELECT  wait_class,
FROM    v$session_event
WHERE   wait_class <> 'Idle'
        AND SID = 255 ;


Sample output from the Oracle v$session_event table.

I’m waiting on me

Don’t be afraid to explore on your local installation. There’s no harm in playing around with different Oracle features to determine how they work and what kind of information you can glean from them.

You can also use the GV$ views, thanks to Jeff Smith for pointing out my omission. These are views that are designed for Oracle RAC so you can see the health of every node in the RAC cluster. The upside of this is that you can get a big picture of an entire cluster and then dive into individual nodes using the V$ views on each node. You can even execute queries that use the GV$ views, even if you don’t have RAC, and you’ll be just fine.

A Word of Warning

Be careful with the both the data dictionary and the V$ views – querying certain views may trigger license usage to show up in the dba_feature_usage_statistics view. Before using features like Active Session History or the Automatic Workload Repository, make sure that you have the proper features licensed for your Oracle database. Using these optional features for your own education is fine.

Choosing a Cloud Deployment Model [Video]

Tune in here to watch our webcast video for this week! To join our weekly webcast for live Q&A, make sure to watch the video by 12:00 PM EST on Tuesday, October 7! Not only do we answer your questions, we also give away a prize at 12:25 PM EST – don’t miss it!

Have questions? Feel free to leave a comment so we can discuss it on Tuesday!

Five Oracle Myths

It’s Hard to Configure

Historically speaking, Oracle was a bit painful to configure. A DBA needed to be able to size internal components like the rollback segment, buffer cache, large object cache, sort area, and a number of other memory structures. This gave Oracle a reputation for being difficult to configure. Rightfully so – compared to SQL Server at the time, Oracle was difficult to configure.

Starting with Oracle 9i, the database included limited automatic memory management features. Instead of having to size many aspects of memory, Oracle DBAs just had to size two. And with the introduction of Oracle 11g, Oracle memory management became a matter of configuring a max memory target.


A database is a series of tubes, right?

Tuning is Complicated

Database tuning is hard. Thankfully databases just come with GUI wizards that work every time, right?

Database tuning is difficult in both SQL Server and Oracle. Oracle DBAs have a wealth of system views to choose from when designing performance reports. There are the usual tools to get information about instance-level CPU, disk, and other waits.

On top of the system views, Oracle users who have licensed the Performance Pack have access to the Automatic Workload Repository (AWR). AWR constantly collects information about Oracle performance and allows DBAs to get a fine-grained view of performance at a number of levels. On top of the system views provided by AWR, it’s also possible to generate AWR reports that generate analysis of database performance over a period of time.

The User Interface is Bad

SQL Server DBAs and developers who are used to SQL Server Management Studio are initially horrified when they’re exposed to Oracle’s command line user interface through SQL*Plus or RMAN. Although the command line is a rough introduction to a product, it’s also a rich environment where users can run scripts, prompt for input mid-script, and create full featured applications with little more than PL/SQL. Although the command line tools appear unforgiving, they offer a wealth of information, built-in help, and query editing capabilities that tie into the user’s primary tools.

Users who refuse to get on the command line aren’t left out in the cold. Oracle has a pair of tools – Enterprise Manager and SQL Developer that provide additional tooling for DBAs and developers. Enterprise Manager provides a dashboard for DBAs and system administrators to review server health at many different levels – from the enterprise through to the datacenter and all the way down to a single server. SQL Developer is a development tool with built-in reports; SQL Server professionals will find SQL Developer to be very familiar.

It Doesn’t Run Well on Windows

“Oracle just doesn’t run well on Windows.” I’ve heard this phrase a lot. Oracle runs on Windows and Windows is officially supported by Oracle for production deployments. Anecdotally, there are very few Windows only bugs for the Oracle database proper; most bugs are cross-platform.

However, you will find that almost all Oracle examples assume you’re running Oracle on a Linux or UNIX system. A quick scan of various forums, blogs, and other online resources indicates that maybe 20% of Oracle deployments are on Windows. Don’t let that stop you from learning about Oracle – most functionality can be accessed with only minimal knowledge of the operating system. For everything else, there’s always your favorite search engine.

You Need a Team of DBAs

Everyone knows that a SQL Server DBA can manage far more SQL Servers than an Oracle DBA, right? After all, with all that manual memory management, lack of tuning, and no Windows support, you need a team of talented UNIX system administrators to keep Oracle running well.

While it may have required a village to run an Oracle database in the past, it hasn’t been that way for some time. Recent versions of Oracle have automated many of the involved processes. Other features like RMAN and AWR reports provide time-saving features that make it easier for DBAs to do more work.

Your Turn

What other misconceptions have you heard about Oracle’s place in the world of databases?

Is Azure Really 60% Faster?

Microsoft just announced a new round of D-grade VMs that have 60% faster CPU and local SSD than can go up to 7,000 IOPS in a canned IOmeter test. Before jumping to conclusions or, even worse, picking a cloud provider, it’s best to look at these numbers critically.

CPU Speeds

The new CPU is being advertised as 60% faster than the previous generation of processors. Clearly this has got to be some next generation hardware, right? Maybe we’ll get access to the new Xeon v3 – it’s not that outlandish of an idea; Amazon Web Services (AWS) had Xeon v2s in their datacenters before the chips were generally available.

Glenn Berry, a consultant who digs into computers for fun, did some initial testing with these new Azure instance types. In his investigations, he saw 2.2GHz E5-2660 chips. These aren’t even the slower end of the new generation of Intel Xeon v2 chips – they’re the previous generation of CPU… from 2012. Azure trades raw power for power efficiency.

If these not-so-fast CPUs are 60% faster, what are your current Azure VMs and SQL Database instances running on? Anecdotal evidence indicates that the current generation of A and P series VMs are running on older AMD Opteron hardware. Older AWS hardware is in the same boat, but it’s slowly being phased out.

When 7000 IOPS really means 437.5 64KB IOPS

When 7000 IOPS really means 437.5 64KB IOPS

SSD Speeds

Microsoft are reporting performance of up to 7000 IOPS per local Azure SSD but persistent storage is still rotational. During the D Series SSD VMs interview a screenshot of iometer at 7,000 IOPS is shown, but no additional information is provided. Iometer tests typically use a 4k read/write block size for tests, which is a great size for random file access. It’s not awesome for SQL Server, but we can divide that by 16 to get a representative SQL Server number…

437.5 64KB IOPS.

Or so the Azure Product Manager says in the original interview. I don’t believe what I hear, and you shouldn’t either, so I fired up an Azure D14 VM to see for myself. What I saw was pleasantly surprising:

All the MBps

All the MBps

If we dig into the IOPS provided by Crystal Disk Mark, we see a decent looking picture unfold:

CrystalDiskMark 3.0.3 x64 (C) 2007-2013 hiyohiyo
 Crystal Dew World :
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]
Sequential Read : 705.103 MB/s
 Sequential Write : 394.053 MB/s
 Random Read 512KB : 528.562 MB/s
 Random Write 512KB : 398.193 MB/s
 Random Read 4KB (QD=1) : 16.156 MB/s [ 3944.4 IOPS]
 Random Write 4KB (QD=1) : 26.506 MB/s [ 6471.1 IOPS]
 Random Read 4KB (QD=32) : 151.645 MB/s [ 37022.8 IOPS]
 Random Write 4KB (QD=32) : 167.086 MB/s [ 40792.5 IOPS]

 Test : 4000 MB [D: 2.0% (16.2/800.0 GB)] (x5)
 Date : 2014/09/23 0:24:10
 OS : Windows Server 2012 R2 Datacenter (Full installation) [6.3 Build 9600] (x64)

What’s it really mean? It means that the 7,000 IOPS number reported was probably for 4KB random writes. It’s hardly representative of SQL Server workloads, but we also can see what kind of numbers the drives will pull under significant load.

Comparing AWS and Azure Performance

AWS offers an instance called the r3.4xlarge. It comes with 16 cores and 122GB of memory. The AWS instance type is about the same as the D14 (16 cores and 112GB of memory). The D14 is $2.611 / hour. The AWS instance is $1.944 / hour.

All prices include Windows licensing.

So far, the Azure D-grade instance costs 70 cents more per hour for 4.8GHz fewer clock cycles and 10GB less memory. Not to mention the computational differences between the current generation of CPU and what Azure is running.

Surely the SSD must be amazing…

Not so fast. Literally.

Some AWS local SSDs benchmark have reported numbers as high 20,000 16KB IOPS for random write and 30,000 16KB IOPS for sequential read. Sure, the AWS instance only has a 320GB disk, but it’s capable of performing 5,000 64KB IOPS compared to the 440 IOPS (I rounded up to be generous) that Azure supplies.

In my testing, the AWS local SSD beat out the Azure SSD on random I/O by a reasonable margin:

A reasonable margin (or 100MB/s faster)

A reasonable margin (or 100MB/s faster)

How about those IOPS?

CrystalDiskMark 3.0.3 x64 (C) 2007-2013 hiyohiyo
Crystal Dew World :
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

Sequential Read : 404.856 MB/s
Sequential Write : 350.255 MB/s
Random Read 512KB : 348.770 MB/s
Random Write 512KB : 349.176 MB/s
Random Read 4KB (QD=1) : 21.337 MB/s [ 5209.3 IOPS]
Random Write 4KB (QD=1) : 38.448 MB/s [ 9386.7 IOPS]
Random Read 4KB (QD=32) : 261.320 MB/s [ 63798.8 IOPS]
Random Write 4KB (QD=32) : 237.201 MB/s [ 57910.4 IOPS]

Test : 4000 MB [Z: 0.0% (0.1/300.0 GB)] (x5)
Date : 2014/09/23 1:05:22
OS : Windows Server 2012 R2 Server Standard (full installation) [6.3 Build 9600] (x64)

So… First – Azure offers really good local SSD performance if you decide to purchase the entire instance. Using a D14 instance type is a reasonable expectation for customers deploying SQL Server – SQL Server is a power hungry monster and it deserves to be fed.

Despite their truth, the Azure numbers aren’t all they’re cracked up to be. Here’s how it breaks down:

Cost: 34% more expensive
Sequential Reads: 74% faster
Sequential Writes: 12.5% faster
Random Reads: 42% slower/fewer IOPS
Random Writes: 30% slower/fewer IOPS

Azure has a history of mediocre performance, but it’s well-documented mediocre performance. Azure persistent storage currently maxes out at 500 no-unit-given IOPS per disk (compared to AWS’s 4,000 256KB IOPS for EBS volumes), but these limits are well-documented.

The Bottom Line

Not all clouds are created equal and 60% more doesn’t mean that it’s any better than it was before. It’s up to you, dear reader, to determine what 60% faster means and how that applies to your environment. For companies dipping their toes in the cloud waters, be very wary with the new improved Azure performance. You may find that you’re deploying far more VMs than you thought, just to handle the same workload.

Getting Started with Oracle

Let’s assume you want to get started with Oracle. Maybe your employer is switching to Oracle, maybe you just want a career change. Where do you go to get started?


There’s no need to feel lost.

Getting the Database

You can get a hold of the Oracle database in two main ways – a VM or installing it yourself. Using a VM is definitely the easiest way to get started. Oracle have provided a Oracle VM VirtualBox image that you can install. If you’re not familiar with VirtualBox, that’s okay; Oracle has set up instructions that will get you up and running quickly.

What if you want to install Oracle yourself?

You can get started with Oracle Express Edition. Hit that link and scroll all the way to the bottom. You can download Oracle Express Edition 11g Release 2. 11gR2 is the previous release of Oracle but it’s good for learning basic Oracle concepts and you’ll find a lot people are happily running Oracle 11gR2 in production.

If you want to be on the latest and greatest version of Oracle, you’ll need to download a full edition of Oracle. Even though there’s no Developer Edition of Oracle, there are five editions available to choose from. Personal Edition contains most of the features of Oracle Enterprise Edition and can be purchased from the Oracle store. If you want practice with complex DBA tasks, you’ll want to use Enterprise Edition. Otherwise, Personal Edition is the right choice.

You can also download and install the binaries directly from the Oracle database download page and run a full copy of Oracle while you evaluate the software. To the best of my knowledge, it’s only servers that are part of the development-production cycle that need to be fully licensed.

If you’re even lazier, you can spin up an instance of Oracle in one of many different clouds. Both Microsoft Azure and Amazon Web Services have a variety of different Oracle database configurations available for you to choose from.

Finding Exercises

Some people are self-directed, others prefer guided learning. I find that I’m in the second camp until I develop some skills. If you need to get started quickly, guided labs are a great way to ramp up your skills.

Oracle has created a huge amount of content about the Oracle database. The Oracle Documentation Library is the Oracle equivalent of TechNet. In addition to product documentation, ODL contains several courses – the 2 Day DBA is a good place to get started. From there you can head off into various tuning or development courses or even explore on your own.

Wrapping Up

It’s easy to get started with Oracle. You can either:

Once you’re set up, training is available through the Two Day DBA course, but there’s a wealth of information in the Oracle Documentation Library. A summary of training options is also available through the Oracle Learning Library.

Oracle Backup Basics for SQL Server DBAs [Video]

To get ready for Tuesday’s webcast, here’s what you have to do:

  1. Watch the video below, but watch it today (or over the long weekend). There will be no live presentation this week and we won’t be rehashing all of the material in the video.
  2. Write down your questions or comments. (You don’t have to do this, but it’ll make it more fun.)
  3. Attend the live webcast on Tuesday at the usual time (11:30AM Central). Register here.
  4. During the first 10 minutes of the webcast, we’ll give away a prize. The catch is that you have to be there to win.

The live discussion of the video and Q&A won’t be recorded and published, and you also need to be present to win the prize. See you on Tuesday!

Monitoring Oracle with Statspack

At some point, you’re going to need to know what’s wrong with your Oracle instance. While there are a lot of monitoring tools around, there’s always some reason why third party monitoring tools can’t be installed. Oracle has shipped with something called Statspack that provides DBAs with some ability to monitor their Oracle instance.

Statspack: It's like an eye exam for Oracle

Statspack: It’s like an eye exam for Oracle

What Is Oracle Statspack?

Statspack is a set of tools for collecting performance data that Oracle began shipping with Oracle 8i. This isn’t a full monitoring framework, but it helps DBAs isolate poor performance within a time window. Once installed, Statspack can collect snapshots of Oracle performance. This will run on all editions of Oracle – there’s no requirement for Enterprise Edition or any Performance Pack.

Statspack does not set up any kind of regular schedule when it’s first configured. It’s up to you, the DBA, to figure out how often you need to be running Statspack. Since data has to be collected and then written somewhere, make sure you aren’t collecting data too frequently – you will be adding some load to the server.

Do I Need Special Access to Install Statspack?

Depending on how you look at it, either no special permissions are needed to install Statspack or else very high privileges are needed. Basically, you need to able to connect to Oracle with sysdba privileges. Any Oracle DBA responsible should be able to install Statspack. The only thing that might cause some issue is if OS level access is needed for scheduling data collection.

Since Statspack was originally designed for Oracle 8i, there are some changes that need to be made if you are deploying on Oracle 12c. Take a look at the comments on Statspack Examples for help getting Statspack installed on Oracle 12c.

What Kind of Data Does Statspack Collect?

Statspack can collect a lot of information about Oracle. Users can define just how much data they want to collect. The documentation goes to great length to remind DBAs that collecting too much data can slow down the database server.

Statspack collects data based on several configurable SQL thresholds. You can see the thresholds in the perfstat.stats$statspack_parameter table. When a query passes at least one of these thresholds, performance data will be collected.

Multiple levels of data can be collected. Oracle defines five levels of performance data collection – 0, 5, 6, 7, 10.

  • Level 0 Basic performance statistics about locks, waits, buffer pool information, and general background information.
  • Level 5 All of Level 0 plus SQL statement level details like number of executions, reads, number of parses (compiles in SQL Server speak), and memory usage.
  • Level 6 Everything from Level 5 plus execution plans.
  • Level 7 Disk metrics for particular segments that cross a threshold.
  • Level 10 COLLECT ALL THE THINGS! Plus collect information about latching. Typically you shouldn’t be doing this unless someone at Oracle has suggested it. Or youreally know what you’re doing.

This data gets stored in the Statspack tables whenever a snapshot is collected. Over time, these tables will grow so make sure that there’s enough space allocated for their tablespace or else purge out older data using the statspack.purge() function.

How Do I Use Statspack?

To collect data, either use the DBMS_JOB or Oracle Scheduler interface (depending on Oracle version) or use an operating system native task scheduler.

Once you have at least two snapshots you can report on the collected data by running $ORACLE_HOME/rdbms/admin/spreport.sql and supplying a start and end snapshot. Statspack is going to churn for a while and spit back a bunch of information. Since Statspack reports can be many thousands of lines long, spreport.sql will write to a file.

As you look through the file, you’ll find information about I/O, locking, waits, slowest queries running (but not which users/sessions are slow), and potentially a lot more, depending on how much information you’re collecting.

For the uninitiated, Oracle ships with a bunch of scripts installed in the server’s file system. These scripts can be invoked from inside your favorite SQL tool.

You thought this would be simple?

You thought this would be simple?

Limitations of Oracle Statspack

This isn’t a silver bullet, or even a bronze bullet. But it is a bullet for shooting trouble.

Statspack isn’t an automatic process. More sophisticated tools use an agent process to automatically start collecting data once they’re installed. Statspack is not that sophisticated. It requires manual configuration – a DBA needs to set up a schedule for Statspack collection and Statspack purging.

While Statspack reports on an entire server, things get a bit weird when you start bringing Oracle RAC and Oracle 12c Multitenant into the mix. With RAC, Statspack is only reporting on a single node of the cluster – to get full cluster statistics, you should look at other tooling. Statspack can also potentially cause problems on RAC that can lead to cluster instability. With Multitenant functionality, Statspack will report on the server as a whole, but you’ll have to alter the installation scripts to take full advantage of Statspack.

Another limitation of Statspack is the granularity of the data. Performance data is collected at various DBA-specified levels and at a DBA-specified interval – the DBA needs to have good knowledge of how load may vary across a day and schedule Statspack collection appropriately. Statspack metrics can also be skewed – long running events will be reported as occurring in the Statspack interval where the SQL finally finishes. If you are collecting data every 5 minutes and an I/O intensive task runs for thirty minutes, it may look like there’s a significant I/O load in a single 5 minute period.

It may require a practiced eye to correctly interpret the Statspack reports and avoid falsely attributing heavy load to a small time window.

Finally, these metrics can’t be tied back to a single session. It’s possible to see which piece of SQL is causing problems. Frequently that can be enough, but it may still be difficult to determine if it’s a problem on the whole or a problem with a single user’s session. Other tools, such as ASH and AWR can be used to provide finer grained monitoring, depending on the licensing level of Oracle.

Summarizing Statspack

Oracle Statspack can provide good enough performance metrics for many common DBA tasks. By interpreting Statspack reports, a DBA can discover any number of things about the Oracle system they’re in charge of without having to use third party tooling or purchase additional features and options. This can be especially important for those with Oracle Standard Edition systems.

For more information, check out the ORA FAQ article about Statspack and Jonathan Lewis’s collection of Statspack examples.

Generating Identities

The only thing you ever need to use for database identity is an IDENTITY, right? Well, maybe. There are a lot of different options and they all have different pros and cons.

IDENTITY columns

The default way to identify objects in SQL Server is to use an INT or BIGINT column marked as an IDENTITY. This guarantees relatively sequential numbers, barring restarts and failed inserts. Using identity columns put the responsibility for creating and maintaining object identity in the database.

SQL Server will cache IDENTITY values and generate a new batch of identity values whenever it runs out. Because identity values are cached in memory, using identity values can lead to jumps in the sequence after SQL Server is restarted. Since identities are cached in memory in large batches, they make it possible to rapidly insert data – as long as disks are fast enough.


Sometimes the application needs more control over identity. SQL Server 2012 added sequences. A sequence, unlike an identity value, is a separate object in the database. Both application and database code can read from the sequence – multiple tables can share a sequence for an identity column or separate sequences can be created for each table.

Developers using a sequence can use the CACHE value to cache a specific number of sequence values in memory. Or, if the application should have minimal gaps in the sequence, the NOCACHE clause should be used.

The Problem with Sequential Identities

Both IDENTITY and SEQUENCE values keep identity generation squarely in the database and, by using integral values, they keep the value narrow.

You can run into problems with sequential inserts on very busy systems – this can lead to latch contention on the trailing pages of the clustered index. This issue can be resolve by spreading inserts across the table by using a GUID or some other semi-random clustering key. Admittedly, most systems are never going to run into this problem.

GUIDs for Object Identity

Some developers use GUIDs as a way of managing object identity. Although database administrators balk at this, there are good reasons to use GUIDs for object identity.

GUIDs let the application generate object identity. By moving object identity out to the application layer, users can do work in memory and avoid multiple round trips to the database until they’re ready to save the entire set of data. This technique gives tremendous flexibility to application developers and users.

There’s one other thing that a well designed application gets from this technique – independence from the database. An application that generates its own identity values doesn’t need the database to be online 24/7; as long as some other system is available to accept writes in lie of the database, the application still function.

Using GUIDs for object identity does have some issues. For starters, GUIDs are much wider than other integral data types – 16 bytes vs 4 bytes (INT) or 8 bytes (BIGINT). This is a non-issue for a single row or even for a small database, but at significant scale this can add a lot of data to the database. The other issue is that many techniques for generating sequential GUIDs in the application (see NHibernate’s GuidCombGenerator) can still run into GUID collisions.

Integral Generators

What if you could get the best of both worlds? Applications generating unique identities that are also sequential?

The point of identity generation is to abstract away some portion identity from data attributes and provide an independent surrogate value. GUIDs can provide this, but they aren’t the perfect solution. Identity generators like flake or rustflakes promise roughly sequential identity values that are generated in the application layer and are unique across multiple processes or servers.

The problem with an external identity generator is that it is an extra piece of code that developers need to manage. External dependencies carry some risk, but these are relatively safe items that require very little effort implement and maintain.

The Solution

There’s no right solution, there’s only a solution that works for you. You may even use each solution at different points in the lifecycle of the same product. It’s important, though, for developers and DBAs to be aware of how identity is currently being handled, the issues that can arise from the current solution, and ideas of how to handle it going forward.

Why Archive Data?

The data story so far

The data story so far

Meet Margot. Margot is an application developer who works for a small company. Margot’s application collects and generates a lot of data from users including their interactions with the site, emails and texts that they send, and user submitted forms. Data is never deleted from the database, but only a few administrative users need to query historical data.

The database has grown considerably because of this historical data – the production database is around 90GB but only 12GB or so is actively queried. The remaining data is a record of user activity, emails, text messages, and previous versions of user data.

Margot is faced with an important decision – How should she deal with this increase in data? Data can’t be deleted, there isn’t budget to upgrade to SQL Server Enterprise Edition and use table partitioning, and there’s a push to move to a cloud service to eliminate some operational difficulties.

Using Partitioned Views to Archive Data

A Partitioned View

A Partitioned View

One option that Margot has read about is “partitioned views” – this is a method where data is split into two or more tables with a view over the top. The view is used to provide easy access to all of the information in the database. Storing data across many tables means DBAs can store data in many different ways – e.g. compressed tables or filegroups and tiered storage.

There’s a downside to this approach – all of the data is still in one database. Any HA solutions applied to the live portion of the data set will have to be applied to the entire data set. This could lead to a significant cost increase in a hosted/cloud scenario.

Archiving Data with a Historical Database

Archive this!

Archive this!

The second thing that sprang to mind was creating a separate archival database. Old data is copied into the archival database by scheduled jobs. When users need to run historical reports, the queries hit the archival database. When users need to run current reports, queries are directed to the current application database.

Margot immediately noticed one problem – what happens when a user needs to query a combination of historical and current data? She’s not sure if the users are willing to accept limited reporting functionality.

Archiving Data with Separate Data Stores

One active database. One archival octopus.

One active database. One archival octopus.

A third option that Margot considered was creating a separate database for the data that needed to be kept forever. Current data would be written to both the live database and the historical database. Any data that didn’t need to be ever be in the current database (email or SMS history) would only be written to the historical database.

Although this made some aspects of querying more complex – how could row-level security from the primary database be applied to the historical database – Margot is confident that this solves the majority of problems that they were facing.

This solution would require application changes to make querying work, but Margot and her team thought it was the most flexible solution for their current efforts: both databases can be managed and tuned separately, plus the primary database remains small.

Other Ideas?

Not every database needs to scale in the same way. What ideas do you have to solve this problem?