Find Foreign Key Problems: Answers & Discussion

Your Query Exercise was to find foreign key problems: data that might stop us from successfully implementing foreign keys in the Stack Overflow database.

Question 1: Write queries to find the problematic data.

Montro1981’s thorough answer broke the challenge into two parts: first, check to see if there is ANY bad data, and if so, check to see how pervasive it is. What I really like about this query below is that it recognizes that:

  • Each table, and each column, might have different foreign key problems
  • Some of those problems might be big, and some might be small
  • Some might have only happened a long time ago, and some might still be ongoing

Heres what the results look like for the 2018-06 training version of the Stack Overflow database, and of course they’ll be different depending on which version you use:

We can see that the first two issues (Posts & Comments with invalid UserIds) is pretty rampant, whereas comments without valid PostIds is much less frequent. However, all 3 of the foreign key candidates are still having problems right up til the end of the data set (2018-06), which means whatever’s causing this “bad” data is still happening.

Question 2: What might have caused the data problems?

Let’s examine the posts rows that don’t have matching users rows. You can click on this to zoom in if you like:

We’re trying to join over to the users table using the owneruserid column, but the OwnerUserId is 0 for these rows, and apparently there’s no Users.Id 0. That 0 sounds like it might be a hard-coded magic value.

What are some reasons that a post might not have a valid owneruserid?

  • Perhaps users are allowed to delete their accounts, but Stack Overflow has chosen to keep their questions & answers around for posterity.
  • Perhaps there were bugs in transaction handling code. Perhaps there was a single stored procedure that was supposed to add a user and add their first post, but something went wrong, and only the post got inserted.
  • Perhaps there was database corruption.
  • Perhaps someone ran a delete on the users table with an incorrect where clause. (C’mon, we’ve all been there.)
  • Or multiple of the above, especially for an app that’s been around for 15+ years.

I’ve seen all of these in various systems over time, but given what I know from Stack (I used to work with ’em), the first option is the likely cause for most (but not all) of the problematic rows we’re seeing here.

Pull up the site to see how it’s handling the data problem.

Normally, when you’re looking at a question or answer on StackOverflow.com, you see the person who wrote it. Let’s take a look at what the web site shows for this post.

One of the things I love about StackOverflow.com is their URL handling. If you know a row’s id, you can pull it up on the site. We’re in the posts table, which is where questions & answers are stored, and let’s say we wanna see id 34, the one about “How to unload a ByteArray”. To see posts.id = 34, put this into your browser’s address bar:

https://stackoverflow.com/questions/34/

Click that, and Stack magically redirects you to:

https://stackoverflow.com/questions/34/how-to-unload-a-bytearray-using-actionscript-3

Isn’t that cool? It works for both questions & answers, too. For example, to see posts.id 18:

https://stackoverflow.com/questions/18/

Which redirects you to:

https://stackoverflow.com/questions/17/binary-data-in-mysql/18#18

Because posts.id 18 is actually an answer for question 17. More about that join structure some other day – for now, let’s go back to the problem at hand, and we’ll look at posts.id 34 about unloading a byte array.

At the bottom right, Stack shows who asked the question. It says “asked Aug 1, 2008 at 12:30, maclema.”

Wait – how does Stack know the user’s name?

How are they getting the name ‘maclema’ if there’s no working join over to the users table? In the SQL Server versions of the Stack Overflow database I’ve provided in the past, there’s no OwnerDisplayName column, but there’s actually one in the current XML data dumps. To see it, let’s look at another version of the database that includes the OwnerDisplayName in the Posts table:

And look over at the far right column:

Ah-ha! It looks like Stack de-normalized the data a little. Perhaps, when a user deletes their account, Stack goes through all of the posts with their owneruserid, and sets the ownerdisplayname value for historical purposes. Then, they can set the posts.owneruserid to null, and delete the users row.

Just for curiosity’s sake, let’s look at a few posts rows where the owneruserid is NOT null:

It kinda matches our hypothesis: they might not be populating the ownerdisplayname column until a user deletes their row. Some of the rows do have an ownerdisplayname populated though. This lines up with my real-world experience with decade-old production databases: the data is all over the place, in varying levels of explanations and quality. Perhaps Stack’s handling of deleted user accounts changed over time, too.

Question #3: any changes we want to make?

The last part of your query exercise was to think about any changes you might want to make to the app, processes, or database.

I’ll be honest with you, dear reader: I laid that as a trap for professional database administrators. When a DBA sees a database without foreign key referential integrity implemented, their first knee-jerk reaction is usually, “WE HAVE TO PUT IN FOREIGN KEYS RITE NAO TO FIX THIS PROBLEM!!1!!ONE!”

I agree with that in the beginning of an application’s lifecycle, but now that we’ve got a 15-year-old database with a ton of data that violates foreign key constraints, what are we supposed to do?

Could/should we add a hard-coded “Unknown User” row, and link the data to that? You’ll often see this with a “magic number” users row with an id of, say, 0 or -1. If we take that approach, we also have to modify our users-deletion process to assign posts.owneruserid to that magic value, or we have to implement triggers on tables to do that work. There are performance and concurrency issues associated with putting that work on the database tier, though.

In Martin’s excellent answer, another problem with the magic-number approach came up. “Magic numbers are bad because they’re pretty much always undocumented, they make the code hard to read and because new team members aren’t aware of them.” (I feel Martin’s pain, but in my experience, everything in the application is undocumented, so… there’s that.)

Could/should we set columns to null? If a User or Post is deleted, should we set all the referring columns to be null? And what should do that work, the app or a trigger? Foreign keys could be implemented in that case, but… our reports might not produce the data we expect if we’re doing inner joins. (Inner joins would work with the magic number approach above.)

Could we delete the data that doesn’t have valid relationships? If we did that, we would lose questions & answers on the site, stuff that’s valuable and helpful to looking for solutions. The business might be okay with that, though – that’s a business question, not a data professional’s question.

Even if we do either of those, we still can’t just implement foreign keys safely. We would need to do more investigation to find out if there are other causes for the missing users rows.

How’d you do? What’d you think?

I hope you had fun, enjoyed this exercise, and learned a little something. If you’ve got comments or questions, hit the comment section below. (Keep the comments focused on this exercise, though – we’re not covering the details of the Postgres export just yet. Only so many things I can cover per blog post, hahaha, and this thing’s already way long.)

Previous Post
Announcing the 2024 Data Professional Salary Survey Results.
Next Post
Query Exercise: Find Posts with the Wrong CommentCount

5 Comments. Leave new

  • […] Find Foreign Key Problems: Answers & Discussion – Brent Ozar Unlimited® January 10, 2024 1:16 pm […]

    Reply
  • Craig Yellick
    January 10, 2024 5:58 pm

    Is this really a problem? If this has worked successfully (or close to it) for 15 years it probably does not need a fix that could cause performance issues or worse.

    Reply
    • It’s a great question, and I’ll turn it back on you: given this data in the database, would you say it’s working successfully?

      That for me is the answer – if we’re cool with that data, then leave it as is. If we’re *not* cool with that data, then we’ve got work to do. (And I truly don’t care either way – it’s up to the business users.)

      Reply
    • Hehe, yeah. Introducing foreign keys in an existing database (never mind the age) is … a challenge … to say the least.

      In this case, it would require substantial altering of the core code of the application to work. The app needs to start doing things in the correct sequence.

      Reply
  • Thanks, Brent for highlighting my fact-finding solution to your assignment.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.