Do foreign keys hurt the performance of inserts? Okay, we all know that foreign keys do require some work, but the crazy people of the internet have wildly varying guesses as to just how much work is involved. Estimates varied from “barely any” to “as much as 50% slower”.
I figured that you were going to do the work, so I might as well do it.
I created two tables –
parent table has an
int primary key and a fixed width
filler column. I varied the size of the
filler column across multiple test runs, just to make sure I wasn’t biasing the test by using an artificially large parent table. The testing happened with a parent table that had 2 rows per page, 4 rows per page, and 8 rows per page.
You can find the test code in fk.sql.
The test code inserts 1,000,000 rows in batches of 5,000 rows. Although this isn’t strictly realistic, it’s better than timing a single batch of 1,000,000 inserts.
What happens during the test?
It turns out that no matter what the page size it takes approximately 40ms to insert 5,000 rows with the foreign key in place. When we remove the foreign key, inserting 5,000 rows takes ~26ms.
Although the difference looks big at 5,000 row batch sizes, each insert is taking, on average, 0.0072ms with the foreign key and 0.0052ms without the foreign key. These differences are hardly worth noting. And, let’s face it, 2 microseconds is a pretty minor price to pay for data integrity.
I ran the test one last time with a batch size of 1 row, just to see how much difference there was in insert performance.
When it comes down to single row performance, the difference is neglible. We’re spending all of our time waiting for other parts of the system.
It’s not terribly real, but it’s real enough to say “Yes, foreign keys have overhead” and follow that up with “No, you shouldn’t care, just add the keys.” The tests were done in VMware and the hardware in question is mid-2012 Retina MacBook Pro.
In the real world, we’ve got a lot more than inserts happening, but it’s worth quantifying the cost of a foreign key lookup and realizing that it’s worth having foreign keys.
Kendra says: Wanna know if your foreign keys are ‘trusted’? Check out our script here.
Developers – how much operations work are you doing?
DBAs – how much development do you do?
For most of us, we stick to our assigned job role. Developers write code and then throw it over the wall for the ops team to put in place. If there’s a problem, there is inevitably finger pointing. Eventually a developer gets over to an operations person’s cube and the problem gets solved.
It’s rare that we see any cross disciplinary skill sharing.
In Origin of the Species, Darwin notes that the finches in the Galapagos have become very specialized, depending on the island where they were found. The finches were so specialized that Darwin originally miscategorized a subspecies of finch as a wren. He goes so far as to say:
Seeing this gradation and diversity of structure in one small, intimately related group of birds, one might really fancy that from an original paucity of birds in this archipelago, one species had been taken and modified for different ends
Darwin’s finches all had very different beak sizes and shapes – each finch’s beak had adapted to a different food source. Even though they’re all finches, they worked in very different environments.
What about you? How specialized are you?
I consider myself a developer – I’ve spent most of my career writing applications. Some of those applications focus largely on SQL Server. But I can also configure HA/DR solutions, set up hardware, and plan storage deployments.
One of the problems with overspecialization is that it becomes difficult to survive if your environment changes.
I’m a big fan of mixing job roles. Developers should provide operational support for their features. Operations staff should take part in developing tools or even features for the application. Having a well-rounded set of skills makes it easier to survive when your environment changes.
Every now and then, you need to figure out which version of SQL Server you’re using without knowing in advance. This might happen in a script (like sp_BlitzCache) or you might be using it in a migration.
THe first thing that comes to mind is
@@VERSION. On my SQL Server 2014 RTM installation, this returns an ugly string. Unless you like parsing multiline strings in T-SQL, you’re going to be left using
LIKE on a hideous string like this one:
Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Feb 20 2014 20:04:26 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)
Imagine if you needed to get the build number of SQL Server out of there to validate against a list of builds to see if you had the current version of SQL Server installed. Pulling the build out of the major.minor.build.revision string isn’t easy when you first have to pull that string out of a bigger string.
There’s a better way than mucking around in
SERVERPROPERTY built-in function has a lot of great functionality. We’re only going to concern ourselves with the
ProductVersion portion of
SERVERPROPERTY. How do we use it?
SELECT CAST(SERVERPROPERTY('ProductVersion') as nvarchar(128))
That should return nothing but the major.minor.build.revision for the current version of SQL Server. In my case, this returns:
Rather than parse the values myself, I created a temporary table with computed columns:
CREATE TABLE #checkversion ( version nvarchar(128), common_version AS SUBSTRING(version, 1, CHARINDEX('.', version) + 1 ), major AS PARSENAME(CONVERT(VARCHAR(32), version), 4), minor AS PARSENAME(CONVERT(VARCHAR(32), version), 3), build AS PARSENAME(CONVERT(varchar(32), version), 2), revision AS PARSENAME(CONVERT(VARCHAR(32), version), 1) ); INSERT INTO #checkversion (version) SELECT CAST(SERVERPROPERTY('ProductVersion') AS NVARCHAR(128)) ;
common_version column will display the version of SQL Server as a floating point number –
12.0 for SQL Server 2014 or, in the case of SQL Server 2008R2,
10.5. When you’re targeting scripts for specific families of SQL Server, knowing the high level version can make scripting a lot easier.
What’s the final output look like for me?
That’s really all there is to finding and parsing the SQL Server version. Just a simple insert into a temporary table and a few computed columns and you’re ready to go.
We’ve all had to move a database. For databases above a small size, this is a painful process. While there are a number of ways that you can move a database without downtime, they all involve a lot of extra work. What if there were something easier?
For most people, moving a database is a one time thing – you’re moving to a new server. That’s pretty much it. I spent 5 minutes staring at my ceiling trying to figure out other reasons that most people might move a database from one SQL Server to another.
For some people, though, databases could move on a regular basis. Think about it like this – you host a software as a service (SaaS) application. You’ve got one database per relative unit of work (customer, forum, cat picture collection). These databases can sit idle for months before, suddenly, there’s tremendous activity.
Normally, you wouldn’t make each SQL Server capable of handling peak load for any section of your SaaS. Instead, you would probably have a few servers ready to handle the worst load and move your busiest database over to those servers.
Typically, this is going to involve a small outage and setting up something like database mirroring or log shipping in order to move the database with minimal down time.
Microsoft has supported housing SQL Server databases on a file share for a while now. If you’re not familiar with the idea, I recommend checking out the MSDN article Install SQL Server with SMB Fileshare as a Storage Option.
Moving a database with SMB is straight forward – detach the database from the old server, attach the database to the new server, bring the database online. For most of us, this quick change is much preferred to other methods of moving the database.
Although SMB file shares are a new technology, they’re worth considering for new SQL Servers. SMB file shares make it easy to move databases between SQL Servers. If you want to learn more about using SMB check out the MSDN article on using SMB and SQL Server.
Brent says: If your storage area network (SAN) is using older connectivity methods like 2Gb or 4Gb FC, and your Ethernet is 10Gb, you might even see a performance boost.
We’ve all had this problem before – the production server is significantly more powerful than the development server. You tune a query and you’re pretty sure you’ve made it faster, but there’s that doubt in the back of your mind. There are so many things that can change performance – memory, CPU count, disk performance – how can you tell that you’ve made a query better?
When you know you have a slow query that you need to tune, run it in production one last time. But, before you do that, gather CPU and IO metrics using
STATISTICS IO and
SET STATISTICS IO, TIME ON; EXEC dbo.MyAwfulQuery @slow = 'yes';
You’ll get a bunch of jibber jabber in the “Messages” tab that documents how much CPU time was spent compiling and running your query as well as information about logical IO. Save this off to a text file for now.
Go to your dev server. Tune your query. I’ll wait.
As you’re tuning your query, pay attention to the CPU time and logical IO numbers that you’re getting back from
STATISTICS IO and
STATISTICS TIME. These are a relative indication that things are getting better. If there are CPU differences, you’re likely to see different CPU numbers on that dev machine, but that’s OK. We’re looking to compare total logical IO and total CPU time in milliseconds.
If you want an easy route to compare your results, you can paste the output from
STATISTICS IO and
STATISTICS TIME into statisticsparser.com. This will go through the results and push the data into a nice table for your perusal.
STATISTICS IO and
STATISTICS TIME to tune helps me figure out how well I’m doing compared to where I started from. Some queries move around a lot of data. In these cases, it can help to know how much faster dev is than production.
If you really want a reasonable comparison between dev and production, you can find some guidance in the Fast Track Data Warehouse Reference Guide. There’s a section of this document about measuring the MCR (Maximum Consumption Rate) and BCR (Base Consumption Rate) of your SQL Server. These numbers reflect 100% cache reads and 100% disk reads, respectively.
Once you have MCR and BCR, you make effective predictions about how query performance could differ between the two. This gives you the power to predict, with some margin of error, how fast newly developed queries can run.
STATISTICS IO and
STATISTICS TIME you can determine if a query is going to use fewer resources. The MCR and BCR give you relative numbers for how fast a server can process data. Between these measurements, you can predict how a query will perform between environments and what kind of impact your changes will have.
Brent says: as you get used to doing these measurements, you’ll also be able to tell if a new dev-only query is going to be fast enough in production, too. You’ll know how many logical reads is slow (or fast) in your environment.
Erik says: The hardest thing for me to ever get was commensurate data to test on in dev. If you don’t have it, get it. If you have it, don’t waste it.
Many databases have the ability to create automatically incremented numbers. In SQL Server, we can use an `IDENTITY` property on a column to get autoincrementing numbers. But what can do we in Oracle?
Sequences work in all recent versions and editions of Oracle. The default way to do this is pretty simple:
CREATE TABLE my_table ( id NUMBER PRIMARY KEY, whatever VARCHAR2(30) ); CREATE SEQUENCE my_table_seq ; CREATE OR REPLACE TRIGGER my_table_insert BEFORE INSERT ON my_table FOR EACH ROW BEGIN SELECT my_table_seq.nextval INTO :new.id FROM dual ; END ; /* Or, if you hate triggers, you can make sure all inserts look like: */ INSERT INTO my_table (id, whatever) VALUES (my_table_seq.nextval, 'Erik');
Now that SQL Server supports sequences, you could use sequences to help with portability between both systems, but there’s a gotcha – SQL Server’s next value syntax is different from Oracle’s. That means you’d need to use a trigger based solution if you didn’t want code to change.
What about identities? With SQL Server, we just mark a column as having an identity value. Magic! It’s done for us. Oracle is a commerical database, surely it has a magic option, right?
Oracle introduced identity columns with Oracle 12c. This lets developers keep using the code they’re used to. Check this out:
CREATE TABLE my_table ( id NUMBER GENERATED AS IDENTITY PRIMARY KEY, whatever VARCHAR2(30) );
This lets you keep using the same kind of inserts that you would normal use for SQL Server, with one execption. Oracle gives you some flexibility for generating identities – you can generate them `ALWAYS` (the default) or `BY DEFAULT` (when a value isn’t provided).
If you really want to mimic SQL Server’s identity behavior, including `IDENTITY INSERT` then you can create your table like this:
CREATE TABLE my_table ( id NUMBER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, whatever VARCHAR2(30) );
You can even go so far as to protect yourself from trying to insert `NULL` values into the table using the `ON NULL` clause:
CREATE TABLE my_table ( id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY PRIMARY KEY, whatever VARCHAR2(30) );
Sequences and Identities in Oracle and SQL Server
If you want to make sure you can reuse as much code as possible between Oracle and SQL Server, identity columns are a safe bet. The vast majority of code won’t need to change and your application will be none the wiser.
Brent says: when people talk about porting their application from one database back end to another, this is a good example of how it’s so tricky. Sure, both databases have exactly the same functionality available, but even something as simple as an identity column is syntactically different between them.
Erik says: In case anyone’s wondering, it didn’t hurt getting inserted into an Oracle table. The bathrooms in Tablespace were very clean.
Raise your hand if you’re storing BLOBs in the database.
I get to say that a lot during our training. Every time I say it, nearly every hand in the room goes up. Some hands go up faster than others, but eventually nearly every hand is up.
It’s a design that happens far more often than it should, but it does happen.
People put binary data in the database because they need the data to be point in time consistent with the rest of the database. It’s not enough to save space in the database if you can’t recover the file to a moment in time.
Think about this scenario:
- 8:00AM – Brent saves a new contract.
- 9:00AM – Kendra modifies the contract and saves it.
- 9:30AM – Jeremiah modifies the contract and saves it.
If the contract is being stored inside the database, we can recover to any point in time and have the appropriate version of the document. It may not be the most current version of the contract, but it’s the version of the document that’s consistent with the rest of the database.
File systems are great. They do an excellent job of storing files and organizing them into folders. File systems don’t do a great job of being point in time consistent with a relational database. There’s no transaction log to help us roll back writes that are in flight.
It’s a lot of work to get a full database backup and a file system back up to be remotely close to the same point in time. Restoring a database to a point in time is easy. Restoring a file system to a point in time is close to impossible.
There’s a third option available – some kind of appliance that sits between database and the file system. The appliance should manage file metadata and provide all access to the files in the file system.
Commercial databases ship with features that sound similar. SQL Server has a
FILESTREAMdata type and Oracle has both a
ORD data type. Both of these types let the database interact with files in the file system. But they still have a problem – you’re stuck managing data through the database. Let’s be clear: this is not an appliance.
Content Addressable Storage (CAS) is a mature technology. The idea behind CAS is that a hardware device handles the meta-data for a given incarnation of a file. A developer sends a file into the CAS appliance and the CAS appliance returns a pointer to the file. Whenever the file changes, a new copy is created and a new handle is returned to the developer. Files can’t be modified, so any thing stored in the database can only point to the right version of the file.
We can combine this with a database pretty easily. Instead of storing a file path in the database, we store the handle that we get back from the CAS system.
The main reason people store BLOBs in the database is so they can get blobs that are consistent with a point in time in the database. By using a storage device that cannot be modified (the CAS), we can make sure that the location we’ve stored in the database is always the right location – there’s no way to tamper with the files that we’re storing, so whatever gets stored in the database is correct.
There’s overhead to this approach – old data may never get cleared out. Typically, though, CAS systems store data on large, slow disks. There’s little need for the throughput that we use for a relational database store system. Do those cat pictures really need to be stored on RAID 10 SSDs? Moving BLOB storage outside of the relational database will free up resources for serving queries. Picking the right way to store your BLOB data will make it easier to scale your system.
Kendra says: Finding the right storage for large objects is a huge architectural decision that impacts performance and availability. Choose wisely!
Brent says: Want your SQL Server’s DBCCs and backups to run faster? This can help a lot.
Doug says: “It’s a lot of work to get a full database backup and a file system back up to be remotely close to the same point in time.” -> This is a major drawback that’s easily overlooked. Make sure everyone’s okay with that possibility when choosing the file system for BLOB data.
When we’re making changes to a database, it’s important to know which tables and indexes are in use. If we don’t need an index, maybe we can drop it. If nobody is using a table, do we still need it?
We can find out which tables and indexes are in use through two different mechanisms.
Of course Oracle tracks the tables in use. In this case, we can see this information in the view
V$SEGMENT_STATISTICS. This view contains a lot of information about any all of the different segments in your Oracle database. A [segment][seg] is the set of extents (data blocks) allocated to a single database object. The
V$SEGMENT_STATISTICS view needs to be pivoted to get the information we need – it contains one row for reads, one for writes, etc.
This basic query will let us see the volume of activity on different segments in the database:
SELECT vss.owner, vss.object_name, vss.subobject_name, vss.object_type , vss.tablespace_name , SUM(CASE statistic_name WHEN 'logical reads' THEN value ELSE 0 END + CASE statistic_name WHEN 'physical reads' THEN value ELSE 0 END) AS reads , SUM(CASE statistic_name WHEN 'logical reads' THEN value ELSE 0 END) AS logical_reads , SUM(CASE statistic_name WHEN 'physical reads' THEN value ELSE 0 END) AS physical_reads , SUM(CASE statistic_name WHEN 'segment scans' THEN value ELSE 0 END) AS segment_scans , SUM(CASE statistic_name WHEN 'physical writes' THEN value ELSE 0 END) AS writes FROM v$segment_statistics vss WHERE vss.owner NOT IN ('SYS', 'SYSTEM') GROUP BY vss.owner, vss.object_name , vss.object_type , vss.subobject_name , vss.tablespace_name ORDER BY reads DESC;
This approach will show us information about reads, writes, and segment scans (full table scans) since the dynamic view was last cleared. Typically, this is going to show us information since the database was last started.
The downside of
V$SEGMENT_STATISTICS is that it’s across the lifetime of that process. If you happen to have great database up time, a table or index could show up in
V$SEGMENT_STATISTICS even though it hasn’t been used in weeks or months.
In order to determine which queries are hitting which tables, we can start sampling the current waits in the system.
V$SESSION_WAIT contains a bunch of information including the cryptic
p3 columns. For disk related waits, we only need the
p2 waits. These refer to the file and block number that are a part of the wait. To see what’s going on, we can just do something like:
SELECT vs.username , vsw.wait_class, vsw.EVENT AS wait_type , vsw.WAIT_TIME_MICRO / 1000 AS wait_time_ms , vsw.TIME_REMAINING_MICRO / 1000 AS time_remaining_ms , vsw.STATE , de.SEGMENT_NAME , de.SEGMENT_TYPE, de.OWNER , de.TABLESPACE_NAME FROM V$SESSION_WAIT vsw JOIN V$SESSION vs ON vsw.SID = vs.SID LEFT JOIN DBA_EXTENTS de ON vsw.p1 = de.file_id AND vsw.p2 BETWEEN de.BLOCK_ID AND (de.BLOCK_ID + de.BLOCKS) WHERE vsw.wait_class <> 'Idle' AND vs.username IS NOT NULL ORDER BY wait_time_ms DESC;sqs
This query will show us:
- Who’s running a query
- What their current wait is
- If it’s a disk wait, which tablespace, object, and object type is being waited on.
The problem with this approach, though, is that it will only catch the queries that are running at the moment you sample. With enough samples, this can be effective, but sampling the system is going to put a small load on it.
Active Session History
If you’re licensed for the Oracle Diagnostic Pack, you’ll have access to the Active Session History. If you aren’t licensed for the Diagnostic Pack, keep reading, but don’t run these queries on any of your Oracle systems – your Oracle auditors will love you, but your accountants won’t.
Active Session History samples data over time and lets us know which queries have been running inside a given time window. The upside of this is that we can dig into the data collected and figure out if the tables we’re interested in have been used inside a given time window.
SELECT du.username, s.sql_text, MAX(ash.sample_time) AS last_access , sp.object_owner , sp.object_name , sp.object_alias as aliased_as , sp.object_type , COUNT(*) AS access_count FROM v$active_session_history ash JOIN v$sql s ON ash.force_matching_signature = s.force_matching_signature LEFT JOIN v$sql_plan sp ON s.sql_id = sp.sql_id JOIN DBA_USERS du ON ash.user_id = du.USER_ID WHERE ash.session_type = 'FOREGROUND' AND ash.SQL_ID IS NOT NULL AND sp.object_name IS NOT NULL AND ash.user_id <> 0 GROUP BY du.username, s.sql_text, sp.object_owner, sp.object_name, sp.object_alias, sp.object_type ORDER BY 3 DESC;s
You can use this query as a basis to help you isolate who is using which tables and how frequently the queries are being run. If you’re looking for one particular table or index, you can start filtering through the query plan operations to find the object in question.
Three Ways to Find the Truth
So, there you have it – three ways to find some version of the truth. Looking at
V$SEGMENT_STATISTICS will let you understand the way tables have been used historically, but it won’t help you understand if a table has been used recently.
V$SESSION_WAIT can be sampled to get a better idea of which tables and indexes are being used and who is using them. And if you need an accurate view, and you’re licensed for the Diagnostic Pack, you can use
V$ACTIVE_SESSION_HISTORY to review how frequently a table or index has been used over time.
Just because it’s plain text, that doesn’t mean it has to be ugly. Join Jeremiah Peschka and learn the basics of formatting Oracle SQL*Plus output.
Kendra says: that ability to eliminate repeating values is pretty rad!
Common table expressions are a feature of SQL that lets a developer create a query that can be referenced multiple times. This feature gives developers another tool to add flexibility or just to simplify code.
Why Common Table Expression?
Why is it called a Common Table Expression (CTE)? Because that’s what the documentation calls it!
No, seriously, that’s it. Different databases have different terms for this feature. That’s what we call it in SQL Server.
What does a CTE do?
A CTE effectively creates a temporary view that a developer can reference multiple times in the underlying query. You can think of the CTE as if it were an inline view.
Here’s a slight re-write of a StackExchange query to find interesting and unanswered questions.
WITH unanswered AS ( SELECT p.id FROM dbo.Posts p WHERE (SELECT COUNT(*) FROM dbo.Posts a WHERE a.ParentId = p.Id AND a.Score > 0) = 0 AND CommunityOwnedDate IS NULL AND ClosedDate IS NULL AND ParentId IS NULL AND AcceptedAnswerId IS NULL ) SELECT TOP 2000 unanswered.Id AS [Post Link] , (sum(t.[Count]) / 10.0 + us.Reputation / 200.0 + p.Score * 100) AS Weight FROM unanswered JOIN dbo.Posts p ON unanswered.Id = p.Id JOIN PostTags pt ON pt.PostId = unanswered.Id JOIN Users u ON u.Id = p.OwnerUserId GROUP BY unanswered.Id, u.Reputation, p.Score ORDER BY Weight DESC ;
The CTE, lines 1 – 12, effectively creates a temporary view that we can use throughout the rest of the query. You can also think of it in the same way that you’d think of a derived table (a join to a subquery).
CTEs make it possible to simplify code, express complex ideas more concisely, or just write code quickly without having to worry as much about structure. They’re a great feature of the SQL language that many people overlook.
If you want to learn more about CTEs, check out our upcoming Advanced Querying and Indexing class in both Chicago and Portland. We’ll cover the basics of CTEs, pitfalls, and performance tuning options.