Adventures In Foreign Keys 1: Setting Up Foreign Keys in Stack Overflow

Last Updated December 29, 2021

In A Foreign Key, In A Foreign Table

Much of what people want from foreign keys, like referential integrity and join elimination, are only as guaranteed as much as SQL Server can trust your constraints (and even then…). The same goes for check constraints, too.

Thankfully, things like Primary Keys and Unique Constraints are sort of self-trusting, in that there’s really no way to add them without checking the underlying data. Sure, you can ignore dupe keys (if SQLblog is down again, try the Wayback Machine version), but that’s a different sort of trick.

This week, I’ve got 5 posts about foreign keys to look at various aspects of implementing them, and their behavior once they’re in place. If you think I missed something, or didn’t adequately cover it, hang tight. It’ll probably show up eventually.

Pretty much maybe.

Bookworm

When you read books about relational design — Louis Davidson has an excellent one — the importance of things like this are emphasized quite a bit. If you’re starting an app from scratch, you’d be wise to heed the advice in them, especially if it’s an OLTP app. If you take care of it here, you don’t have to worry about enforcing it in your data warehouse.

If you ever want to feel poor design pain, try adding, enforcing, or loading large amounts of data into tables with foreign keys. While you do that, start learning a new language.

But what if you’ve got a database that’s been around for a while. Can you trust the data? Can you trust that no developer ever removed or disabled a constraint to “fix a problem” or “add some custom data” and then forgot to turn them back on for a week or two?

Didn’t think so.

Keying Time Had Come

Continuing to use Stack Overflow as a demo data set, let’s look at if the newest dump is foreign key-able.

Before we get started, I just want to mention that I don’t expect it to be. As far as I can tell, if a user deletes their account, their actions are mostly preserved.

In order to test our data, we’d be wise to create some indexes on important columns.

CREATE INDEX ix_whatever ON dbo.Badges(UserId);
CREATE INDEX ix_whatever ON dbo.Comments(UserId);
CREATE INDEX ix_apathy ON dbo.Comments(PostId);
CREATE INDEX ix_whatever ON dbo.Posts(OwnerUserId);
CREATE INDEX ix_whatever ON dbo.Votes(UserId);
CREATE INDEX ix_apathy ON dbo.Votes(PostId);

CREATE INDEX ix_whatever ON dbo.Badges(UserId);

CREATE INDEX ix_whatever ON dbo.Comments(UserId);

CREATE INDEX ix_apathy ON dbo.Comments(PostId);

CREATE INDEX ix_whatever ON dbo.Posts(OwnerUserId);

CREATE INDEX ix_whatever ON dbo.Votes(UserId);

CREATE INDEX ix_apathy ON dbo.Votes(PostId);

Next, let’s see what kind of nonsense we’ve got to clean up. We’re going to run some selects to get a count of rows in our “action” tables (Badges, Comments, Posts, Votes) to see which ones have orphaned Users or Posts. Since it’s long and repetitive, I’m going to put these at the end of the post. For now, let’s just look at the results.

/*
Badges With No User
----------
3

Comments With No User
----------
337,603

Comments With No Post
----------
5

Posts With No User
----------
227,634

Votes With No User
----------
49,346,043

Votes With No Post
----------
8,109,424
*/

Badges With No User

----------

Comments With No User

----------

337,603

Comments With No Post

----------

Posts With No User

----------

227,634

Votes With No User

----------

49,346,043

Votes With No Post

----------

8,109,424

Votes with no User sticks out, because that data is cleaned from the dump. If they published how people voted, it would be kind of terrible. Brent would find out what I really think about his Answers.

The rest are… Odd. I’m really not sure what Stack Overflow does to remove User or Post actions when they’ve been deleted.

IRL

You’ll probably know what processes do (or don’t do) this kind of clean up work, which might make deciding what to do with dirty data easier.

You have a few options that I can think of quickly:

Delete it
Associate it with dummy Ids
Move it to separate tables for reference

I’ll give you $100 if you guess which one I picked before I tell you that I deleted it. The delete scripts are at the end, too.

You were so close.

Deciding what to do with it (or if you’re going to do anything with it at all) is a business decision. It’s their data. If they decide to follow you on your crazy quest to implement referential integrity, it’s your job as a data person to work with them on the best way to end up with data that complies with your Utopian vision. If they want to keep it in some form, you have to respect that.

If they’re cool with deleting it, well..

Be Kind, Batchwind

Just read this post from Michael J. Swart about batching modifications.

Excellent Adventures

I’m going to do some more of these posts, and talk about the difficulties with implementing foreign keys, cascading actions, indexing, and join elimination.

Thanks for reading!

Scripts

Checking for bad data:

/*Badges without a User*/
SELECT FORMAT(COUNT(*), 'N0') AS BNU
FROM dbo.Badges AS b
WHERE NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE b.UserId = u.Id
);

/*Comments without a User*/
SELECT FORMAT(COUNT(*), 'N0') CNU
FROM dbo.Comments AS c
WHERE NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE c.UserId = u.Id
);

/*Comments without a Post*/
SELECT FORMAT(COUNT(*), 'N0') CNP
FROM dbo.Comments AS c
WHERE NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE c.PostId = p.Id
);

/*Posts without a User*/
SELECT FORMAT(COUNT(*), 'N0') PNU
FROM dbo.Posts AS p
WHERE NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.OwnerUserId = u.Id
);

/*Votes without a User*/
/*This is no good, because it gets cleaned...*/
SELECT FORMAT(COUNT(*), 'N0') VNU
FROM dbo.Votes AS v
WHERE NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE v.UserId = u.Id
)
--AND v.UserId IS NOT NULL; -- Toggle, big difference

/*Votes without a Post*/
SELECT FORMAT(COUNT(*), 'N0') VNP
FROM dbo.Votes AS v
WHERE NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE v.PostId = p.Id
);

/*Badges without a User*/

SELECT FORMAT(COUNT(*), 'N0') AS BNU

FROM dbo.Badges AS b

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE b.UserId = u.Id

);

/*Comments without a User*/

SELECT FORMAT(COUNT(*), 'N0') CNU

FROM dbo.Comments AS c

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE c.UserId = u.Id

);

/*Comments without a Post*/

SELECT FORMAT(COUNT(*), 'N0') CNP

FROM dbo.Comments AS c

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Posts AS p

WHERE c.PostId = p.Id

);

/*Posts without a User*/

SELECT FORMAT(COUNT(*), 'N0') PNU

FROM dbo.Posts AS p

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE p.OwnerUserId = u.Id

);

/*Votes without a User*/

/*This is no good, because it gets cleaned...*/

SELECT FORMAT(COUNT(*), 'N0') VNU

FROM dbo.Votes AS v

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE v.UserId = u.Id

)

--AND v.UserId IS NOT NULL; -- Toggle, big difference

/*Votes without a Post*/

SELECT FORMAT(COUNT(*), 'N0') VNP

FROM dbo.Votes AS v

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Posts AS p

WHERE v.PostId = p.Id

);

Deleting bad data:

/*Badges without a User*/
DELETE b
FROM dbo.Badges AS b
WHERE NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE b.UserId = u.Id
);


/*Comments without a User*/
DELETE c
FROM dbo.Comments AS c
WHERE NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE c.UserId = u.Id
);



/*Comments without a Post*/
DELETE c
FROM dbo.Comments AS c
WHERE NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE c.PostId = p.Id
);



/*Posts without a User*/
DELETE p
FROM dbo.Posts AS p
WHERE NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.OwnerUserId = u.Id
);


/*Votes without a User*/
DELETE v
FROM dbo.Votes AS v
WHERE v.UserId IS NOT NULL
AND NOT EXISTS 
(
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE v.UserId = u.Id
);
GO


/*Votes without a Post*/
DELETE v
FROM dbo.Votes AS v
WHERE NOT EXISTS
(
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE v.PostId = p.Id
);

/*Badges without a User*/

DELETE b

FROM dbo.Badges AS b

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE b.UserId = u.Id

);

/*Comments without a User*/

DELETE c

FROM dbo.Comments AS c

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE c.UserId = u.Id

);

/*Comments without a Post*/

DELETE c

FROM dbo.Comments AS c

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Posts AS p

WHERE c.PostId = p.Id

);

/*Posts without a User*/

DELETE p

FROM dbo.Posts AS p

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE p.OwnerUserId = u.Id

);

/*Votes without a User*/

DELETE v

FROM dbo.Votes AS v

WHERE v.UserId IS NOT NULL

AND NOT EXISTS

(

SELECT 1/0

FROM dbo.Users AS u

WHERE v.UserId = u.Id

);

/*Votes without a Post*/

DELETE v

FROM dbo.Votes AS v

WHERE NOT EXISTS

(

SELECT 1/0

FROM dbo.Posts AS p

WHERE v.PostId = p.Id

);

[Video] Office Hours 2018/11/7 (With Transcriptions)

Adventures In Foreign Keys 2: Common Foreign Key Errors

11 Comments. Leave new

klepras
November 14, 2018 9:17 am

SELECT 1/0 looks bizarre. Is it a reason behind it or it’s just for fun?

Reply
Tom Wickerath
November 16, 2018 7:33 pm

Hi klepras,
I’m also intrigued by the use of SELECT 1/0. I did a little Googling and found some answers:

1.) This posting, https://www.brentozar.com/archive/2018/08/a-common-query-error/, includes the following comments:
Kevin Hill
August 14, 2018 8:28 am
This works…why does Select 1/0 work in the Exists, but not outside of it?

Erik Darling
August 14, 2018 8:34 am
You can put anything you want in the select list of exists or not exists. It doesn’t get selected.

2.) Do Variables Exist? (More of Erik and Brent’s humor)
https://www.brentozar.com/archive/2017/10/do-variables-exist/
Chatterbox
When I asked this stuff in the company chat, Brent chimed in with some queries that he thinks are funny, too.

IF EXISTS ( SELECT 1 / 0 FROM sys.databases )
SELECT ‘I can totally divide by zero’;
GO
IF EXISTS ( SELECT SUM(1 / 0) FROM sys.databases )
SELECT ‘You”re not my real dad’;
GO
IF EXISTS ( SELECT COUNT(1 / 0) FROM sys.databases )
SELECT ‘I know you are but what am I’;
GO
IF EXISTS ( SELECT AVG(1 / 0) FROM sys.databases )
SELECT ‘I know you are but what am I’;
GO

3.) Kenneth Fisher covers this topic as well:
There EXISTS a place where SELECT 1/0 doesn’t return an error.
https://sqlstudies.com/2017/09/07/there-exists-a-place-where-select-10-doesnt-return-an-error/

Reply
Tom Wickerath
November 16, 2018 8:00 pm

Hi Erik,
Your link, in the second paragraph to the Paul White paper “ignore dup keys”, results in a 404 error. Do you have a better URL?

Reply
- Brent Ozar
  November 17, 2018 8:41 am
  
  Sure, anytime when you see an error from a site, check out the Wayback Machine. It’s an archive of web sites:
  
  https://archive.org/web/
  
  You can put the address in there, and see an archived version of the page from prior dates.
  
  Reply
  - Tom Wickerath
    November 17, 2018 11:37 am
    
    Thanks Brent. I frequently have to use The Wayback Machine, because Microsoft has a bad habit (my opinion) of removing very useful KB articles. That said, for a very recent article like this, I would expect reference links to work. I guess you’re saying there is nothing wrong with the URL, but rather the blog provider’s site is not so reliable?
    
    Thanks
    
    Reply
    - Brent Ozar
      November 17, 2018 11:57 am
      
      Yep! SQLblog.com has a nasty habit of going up and down lately.
      
      Reply
Paul White
May 3, 2021 2:57 pm

The permanent home for that missing post is https://www.sql.kiwi/2013/02/a-creative-use-of-ignore-dup-key.html

Reply
Grant Clark
July 15, 2021 6:51 pm

Hey Brent,
Awesome as usual dude.

“The rest are… Odd. I’m really not sure what Stack Overflow does to remove User or Post actions when they’ve been deleted.”

Probably to keep track of scoring?

Reply
- Brent Ozar
  July 15, 2021 6:54 pm
  
  Thanks, but this post was by Erik.
  
  Reply
  - Grant Clark
    July 16, 2021 7:44 am
    
    Ah sorry Erik, great post!
    
    Reply
Nem toda chave estrangeira deve ser indexada – Porto SQL
February 11, 2024 3:45 pm

[…] Adventures In Foreign Keys (Erik Darling) , 2018 série de 5 artigos sobre chaves estrangeiras […]

Reply