In A Foreign Key, In A Foreign Table
Much of what people want from foreign keys, like referential integrity and join elimination, are only as guaranteed as much as SQL Server can trust your constraints (and even then…). The same goes for check constraints, too.
Thankfully, things like Primary Keys and Unique Constraints are sort of self-trusting, in that there’s really no way to add them without checking the underlying data. Sure, you can ignore dupe keys (if SQLblog is down again, try the Wayback Machine version), but that’s a different sort of trick.
This week, I’ve got 5 posts about foreign keys to look at various aspects of implementing them, and their behavior once they’re in place. If you think I missed something, or didn’t adequately cover it, hang tight. It’ll probably show up eventually.
Pretty much maybe.
Bookworm
When you read books about relational design — Louis Davidson has an excellent one — the importance of things like this are emphasized quite a bit. If you’re starting an app from scratch, you’d be wise to heed the advice in them, especially if it’s an OLTP app. If you take care of it here, you don’t have to worry about enforcing it in your data warehouse.
If you ever want to feel poor design pain, try adding, enforcing, or loading large amounts of data into tables with foreign keys. While you do that, start learning a new language.
But what if you’ve got a database that’s been around for a while. Can you trust the data? Can you trust that no developer ever removed or disabled a constraint to “fix a problem” or “add some custom data” and then forgot to turn them back on for a week or two?
Didn’t think so.
Keying Time Had Come
Continuing to use Stack Overflow as a demo data set, let’s look at if the newest dump is foreign key-able.
Before we get started, I just want to mention that I don’t expect it to be. As far as I can tell, if a user deletes their account, their actions are mostly preserved.
In order to test our data, we’d be wise to create some indexes on important columns.
1 2 3 4 5 6 |
CREATE INDEX ix_whatever ON dbo.Badges(UserId); CREATE INDEX ix_whatever ON dbo.Comments(UserId); CREATE INDEX ix_apathy ON dbo.Comments(PostId); CREATE INDEX ix_whatever ON dbo.Posts(OwnerUserId); CREATE INDEX ix_whatever ON dbo.Votes(UserId); CREATE INDEX ix_apathy ON dbo.Votes(PostId); |
Next, let’s see what kind of nonsense we’ve got to clean up. We’re going to run some selects to get a count of rows in our “action” tables (Badges, Comments, Posts, Votes) to see which ones have orphaned Users or Posts. Since it’s long and repetitive, I’m going to put these at the end of the post. For now, let’s just look at the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
/* Badges With No User ---------- 3 Comments With No User ---------- 337,603 Comments With No Post ---------- 5 Posts With No User ---------- 227,634 Votes With No User ---------- 49,346,043 Votes With No Post ---------- 8,109,424 */ |
Votes with no User sticks out, because that data is cleaned from the dump. If they published how people voted, it would be kind of terrible. Brent would find out what I really think about his Answers.
The rest are… Odd. I’m really not sure what Stack Overflow does to remove User or Post actions when they’ve been deleted.
IRL
You’ll probably know what processes do (or don’t do) this kind of clean up work, which might make deciding what to do with dirty data easier.
You have a few options that I can think of quickly:
- Delete it
- Associate it with dummy Ids
- Move it to separate tables for reference
I’ll give you $100 if you guess which one I picked before I tell you that I deleted it. The delete scripts are at the end, too.
You were so close.
Deciding what to do with it (or if you’re going to do anything with it at all) is a business decision. It’s their data. If they decide to follow you on your crazy quest to implement referential integrity, it’s your job as a data person to work with them on the best way to end up with data that complies with your Utopian vision. If they want to keep it in some form, you have to respect that.
If they’re cool with deleting it, well..
Be Kind, Batchwind
Just read this post from Michael J. Swart about batching modifications.
Excellent Adventures
I’m going to do some more of these posts, and talk about the difficulties with implementing foreign keys, cascading actions, indexing, and join elimination.
Thanks for reading!
- Foreign Keys Part 1: Setting Up Foreign Keys in Stack Overflow
- Foreign Keys Part 2: Common Foreign Key Errors
- Foreign Keys Part 3: Why Cascading Deletes Perform Slowly
- Foreign Keys Part 4: How to Index Foreign Keys
- Foreign Keys Part 5: How Join Elimination Makes Queries Faster
Scripts
Checking for bad data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
/*Badges without a User*/ SELECT FORMAT(COUNT(*), 'N0') AS BNU FROM dbo.Badges AS b WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE b.UserId = u.Id ); /*Comments without a User*/ SELECT FORMAT(COUNT(*), 'N0') CNU FROM dbo.Comments AS c WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE c.UserId = u.Id ); /*Comments without a Post*/ SELECT FORMAT(COUNT(*), 'N0') CNP FROM dbo.Comments AS c WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Posts AS p WHERE c.PostId = p.Id ); /*Posts without a User*/ SELECT FORMAT(COUNT(*), 'N0') PNU FROM dbo.Posts AS p WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE p.OwnerUserId = u.Id ); /*Votes without a User*/ /*This is no good, because it gets cleaned...*/ SELECT FORMAT(COUNT(*), 'N0') VNU FROM dbo.Votes AS v WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE v.UserId = u.Id ) --AND v.UserId IS NOT NULL; -- Toggle, big difference /*Votes without a Post*/ SELECT FORMAT(COUNT(*), 'N0') VNP FROM dbo.Votes AS v WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Posts AS p WHERE v.PostId = p.Id ); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
/*Badges without a User*/ DELETE b FROM dbo.Badges AS b WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE b.UserId = u.Id ); /*Comments without a User*/ DELETE c FROM dbo.Comments AS c WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE c.UserId = u.Id ); /*Comments without a Post*/ DELETE c FROM dbo.Comments AS c WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Posts AS p WHERE c.PostId = p.Id ); /*Posts without a User*/ DELETE p FROM dbo.Posts AS p WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE p.OwnerUserId = u.Id ); /*Votes without a User*/ DELETE v FROM dbo.Votes AS v WHERE v.UserId IS NOT NULL AND NOT EXISTS ( SELECT 1/0 FROM dbo.Users AS u WHERE v.UserId = u.Id ); GO /*Votes without a Post*/ DELETE v FROM dbo.Votes AS v WHERE NOT EXISTS ( SELECT 1/0 FROM dbo.Posts AS p WHERE v.PostId = p.Id ); |
10 Comments. Leave new
SELECT 1/0 looks bizarre. Is it a reason behind it or it’s just for fun?
Hi klepras,
I’m also intrigued by the use of SELECT 1/0. I did a little Googling and found some answers:
1.) This posting, https://www.brentozar.com/archive/2018/08/a-common-query-error/, includes the following comments:
Kevin Hill
August 14, 2018 8:28 am
This works…why does Select 1/0 work in the Exists, but not outside of it?
Erik Darling
August 14, 2018 8:34 am
You can put anything you want in the select list of exists or not exists. It doesn’t get selected.
2.) Do Variables Exist? (More of Erik and Brent’s humor)
https://www.brentozar.com/archive/2017/10/do-variables-exist/
Chatterbox
When I asked this stuff in the company chat, Brent chimed in with some queries that he thinks are funny, too.
IF EXISTS ( SELECT 1 / 0 FROM sys.databases )
SELECT ‘I can totally divide by zero’;
GO
IF EXISTS ( SELECT SUM(1 / 0) FROM sys.databases )
SELECT ‘You”re not my real dad’;
GO
IF EXISTS ( SELECT COUNT(1 / 0) FROM sys.databases )
SELECT ‘I know you are but what am I’;
GO
IF EXISTS ( SELECT AVG(1 / 0) FROM sys.databases )
SELECT ‘I know you are but what am I’;
GO
3.) Kenneth Fisher covers this topic as well:
There EXISTS a place where SELECT 1/0 doesn’t return an error.
https://sqlstudies.com/2017/09/07/there-exists-a-place-where-select-10-doesnt-return-an-error/
Hi Erik,
Your link, in the second paragraph to the Paul White paper “ignore dup keys”, results in a 404 error. Do you have a better URL?
Sure, anytime when you see an error from a site, check out the Wayback Machine. It’s an archive of web sites:
https://archive.org/web/
You can put the address in there, and see an archived version of the page from prior dates.
Thanks Brent. I frequently have to use The Wayback Machine, because Microsoft has a bad habit (my opinion) of removing very useful KB articles. That said, for a very recent article like this, I would expect reference links to work. I guess you’re saying there is nothing wrong with the URL, but rather the blog provider’s site is not so reliable?
Thanks
Yep! SQLblog.com has a nasty habit of going up and down lately.
The permanent home for that missing post is https://www.sql.kiwi/2013/02/a-creative-use-of-ignore-dup-key.html
Hey Brent,
Awesome as usual dude.
“The rest are… Odd. I’m really not sure what Stack Overflow does to remove User or Post actions when they’ve been deleted.”
Probably to keep track of scoring?
Thanks, but this post was by Erik.
Ah sorry Erik, great post!