I know. You, dear reader, saw that title and you came in here because you’re furious. You want foreign key relationships configured in all of your tables to prevent bad data from getting in.
But you gotta make sure to index them, too.
Let’s take the Stack Overflow database – I’m using the 50GB Stack Overflow 2013 version for this post. I’m going to start with no nonclustered indexes whatsoever, and then I’m going to add in foreign keys between the Users table and several tables where users add content, like Badges, Comments, Posts, and Votes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/* Implement foreign keys: */ ALTER TABLE dbo.Badges WITH NOCHECK ADD CONSTRAINT fk_badges_users_id FOREIGN KEY (UserId) REFERENCES dbo.Users(Id); GO ALTER TABLE dbo.Comments WITH NOCHECK ADD CONSTRAINT fk_comments_users_id FOREIGN KEY (UserId) REFERENCES dbo.Users(Id); GO ALTER TABLE dbo.Posts WITH NOCHECK ADD CONSTRAINT fk_posts_users_id FOREIGN KEY (OwnerUserId) REFERENCES dbo.Users(Id); GO ALTER TABLE dbo.Votes WITH NOCHECK ADD CONSTRAINT fk_votes_users_id FOREIGN KEY (UserId) REFERENCES dbo.Users(Id); GO |
Now, let’s say my application wants to delete a specific user, id # 26837. First, the app starts by deleting all of the rows in the content tables for this user:
1 2 3 4 5 6 |
/* Do our prep work to delete rows in all the child tables: */ DELETE dbo.Badges WHERE UserId = 26837; DELETE dbo.Comments WHERE UserId = 26837; DELETE dbo.Posts WHERE OwnerUserId = 26837; DELETE dbo.Votes WHERE UserId = 26837; GO |
That does table scans, and let’s say we’re okay with that for some reason because our application rarely does deletes. Now that all the content tables are gone, let’s go back and delete the User row:
1 2 |
DELETE dbo.Users WHERE Id = 26837; GO |
It’s just deleting one row, right? How bad can it be?
Terrible, as it turns out. SQL Server wants to double-check all of the content tables to make sure User 26837 doesn’t own any Badges, Comments, Posts, or Views. We do a ton of logical reads, too:
The first solution you probably think of: index all of the foreign key columns:
1 2 3 4 |
CREATE INDEX UserId ON dbo.Badges(UserId); CREATE INDEX UserId ON dbo.Comments(UserId); CREATE INDEX OwnerUserId ON dbo.Posts(OwnerUserId); CREATE INDEX UserId ON dbo.Votes(UserId); |
That makes the deletes go way faster, but it comes at a few costs: slower inserts & updates to maintain those indexes, a 3% larger database size in this case, 3% longer maintenance jobs, etc.
At one recent engagement, we simply couldn’t use that solution. They already had serious performance problems during inserts: their hardware couldn’t handle any additional slowdowns during inserts & updates, and they already had extensive blocking & writing problems. Other solution options included:
- Queueing the deletes for later – in this Stack Overflow example, say the application doesn’t directly delete the Users rows, and instead, adds rows to a queue table to be handled later. During a maintenance window, we could do the deletes in batches, dropping groups of users across the tables with only one pass.
- Drop the foreign keys – at this particular client, the keys were configured with NOCHECK anyway due to past bad data, and there wasn’t a way we could fix that quickly enough. In that case, dropping the foreign keys made the deletes instantly faster, which was part of the overall solution in getting enough pain relief to go through their busy season.
Do I wish every database had perfect data, perfect foreign keys, perfect indexes, and perfect hardware to sustain all of the load? Absolutely. But in real life, constraints aren’t black and white: they’re fifty shades of … oh, you know.