Query Exercise: Find Foreign Key Problems

Query Exercises
21 Comments

For 2024, I’m trying something new: weekly homework challenges! For this week, let’s say we’ve decided to implement foreign keys, and we need to find data that’s going to violate our desired keys.

Foreign keys, according to AIWe’re going to use the Stack Overflow database, and we’ll focus on these 3 tables:

  • dbo.Users table: with Id column as its primary key
  • dbo.Posts table: with OwnerUserId column noting which Users.Id wrote the post
  • dbo.Comments table: with UserId column noting which Users.Id wrote the comment, and PostId column noting which Posts.Id is being commented on

Before we attempt to implement foreign keys, we need to find data which might violate the foreign key relationships. Are there any:

  • Posts rows whose OwnerUserId does not match up with a valid Users.Id
  • Comments rows whose UserId doesn’t match up with a valid Users.Id
  • Comments rows whose PostId doesn’t match up with a valid Posts.Id
  • And to make your task easier, let’s focus on just the first 100K rows in each table (rows with an Id <= 100000) to see whether or not foreign keys make sense for this database

Your query exercise has a few parts:

  1. Write one or more queries to find these answers as quickly as possible with low load on the database.
  2. Given what you find, hypothesize about what might have caused the foreign key problems.
  3. Given what you learned, are there any changes you want to make to the app, processes, or database?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. We discuss the challenges & techniques in the next post. Have fun!

Previous Post
[Video] Office Hours: Oddball Questions Edition
Next Post
Announcing the 2024 Data Professional Salary Survey Results.

21 Comments. Leave new

  • 52GB database with compat level changed to 2019. I didn’t use Id < 100000 because it gave weird results.

    Scope of the problem:
    SELECT COUNT(*) FROM dbo.Posts P
    WHERE NOT EXISTS (SELECT * FROM dbo.Users U WHERE P.OwnerUserId = U.Id)
    SELECT COUNT(*) FROM dbo.Comments C
    WHERE NOT EXISTS (SELECT * FROM dbo.Posts P WHERE P.Id = C.PostId)
    SELECT COUNT(*) FROM dbo.Comments C
    WHERE NOT EXISTS (SELECT * FROM dbo.Users U WHERE U.Id = C.UserId)

    To find the actual records, just replace COUNT(*) with the Id field name.

    Dig a little deeper to find out what's going on:
    SELECT P.OwnerUserId, COUNT(*) c FROM dbo.Posts P
    WHERE NOT EXISTS(SELECT * FROM dbo.Users U WHERE P.OwnerUserId = U.Id)
    GROUP BY P.OwnerUserId

    SELECT C.UserId, COUNT(*) c FROM dbo.Comments C
    WHERE NOT EXISTS(SELECT * FROM dbo.Users U WHERE U.Id = C.UserId)
    GROUP BY C.UserId

    SELECT C.PostId, COUNT(*) c FROM dbo.Comments C
    WHERE NOT EXISTS(SELECT * FROM dbo.Posts P WHERE P.Id = C.PostId)
    GROUP BY C.PostId

    Problems with Posts:
    – Nearly all of the Posts with an invalid OwnerUserId are using 0. I imagine that's a magic number that's only understood via the devs' unwritten rules.

    Problems with Comments:
    – Similarly, the Comments with a missing UserId are using NULL.
    – Only 5 Comments are missing their PostId. I suspect that's a software bug. Fix the bug, then fix up the data manually.

    App/Database changes:
    – Magic numbers are bad. Don't use them. Putting in 0 seems like inviting parameter sniffing.
    – I would review what GDPR says to do with blog posts when a user wants to be removed. Can you just overwrite columns containing PII or do you need to get rid of the user's comments and posts. If you could just overwrite PII, then you could keep the User record. You would no longer need magic numbers or NULL and you can continue to identify posts as having been made by BMOC users versus newbies.
    – Data cleanup followed by creating the foreign keys.

    Reply
    • Martin – good work! A couple of followup questions:

      * Why are magic numbers bad? (You can’t just say parameter sniffing: the database already has parameter sniffing issues with outlier parameters.)
      * What DOES the GDPR say about that?

      Reply
      • – I withdraw the comment about param sniffing. Just have to remember to think before clicking submit.
        – Magic numbers are bad because they’re pretty much always undocumented, they make the code hard to read and because new team members aren’t aware of them. If you use magic numbers, you should be tasked with optimizing linked server queries as punishment. (If you aren’t sure how, go add a question for Brent’s office hours.)

        Reply
        • “Magic numbers are bad because they’re pretty much always undocumented”

          Isn’t that an easy problem to fix, though?

          I mean… take a deep breath, Martin. It’s just documentation. You’ve put a ton of *good* work into a blog post comment, right? Don’t be scared: turn slowly over towards your work computer, brace yourself, and I think you might – just might – be able to write documentation at *work*, too. 😉

          Reply
    • Since performance was a goal – why do you not first group the posts / pomments by the user_id in a subselect / CTE before you do the WHERE NOT EXISTS in an outer query?

      This way it would not have to do the SEMI LEFT JOIN for millions of rows but only for a few thousands.

      It may become particularly more relevant when you have some more WHERE conditions, maybe even with @parameters and the SQL server may end with using a NESTED LOOKUP when joining the posts table with the users (bad statistics or parameter sniffing or just an unlucky combination of conditions).

      I tested it with your first query on my local PC (SQL 2022 Dev.) and a the 2013 version of the StackOverflow-DB and an additional index on the OwerUserId column in the posts table (you should always index columns that are used in a Foreign Key)

      SELECT SUM(p.cnt)
      FROM (SELECT P.OwnerUserId, COUNT(*) cnt
      FROM dbo.Posts P
      GROUP BY P.OwnerUserId) AS p
      WHERE NOT EXISTS (SELECT * FROM dbo.Users U WHERE P.OwnerUserId = U.Id)

      was usually 0,4 seconds faster and took 2 seconds lesser CPU time than your version without the “pre-grouping”.

      Reply
      • Thomas – so to rephrase your answer, “With different indexes, a different query performed better for me.”

        That’s a great point! That does mean, however, that you should be posting your own complete homework answer with your results & approach, rather than critiquing the results of others, eh? 😉

        Reply
      • Quote “… additional index on the OwerUserId column in the posts table (you should always index columns that are used in a Foreign Key)” Unquote

        While you should index foreign key columns, this should be part of the implementation stage of such FKs, and then adding an index for the sake of a one-time query seems … excessively pricey.

        Reply
    • Quote “Problems with Posts:
      – Nearly all of the Posts with an invalid OwnerUserId are using 0. I imagine that’s a magic number that’s only understood via the devs’ unwritten rules.”

      This is a very big assumption you are making, in my opinion it is better to validate this with the app developers. It could just be that a dev made it 0 instead of NULL because he/she didn’t want to/understood proper NULL handling.

      Reply
  • SELECT COUNT(*)
    FROM Posts
    LEFT OUTER JOIN Users ON Users.Id = Posts.OwnerUserId
    WHERE Posts.Id < 100000
    AND Posts.OwnerUserId IS NOT NULL
    AND Users.Id IS NULL;
    /*
    Sorry Mr Ozar, the dog ate the rest of my homework.
    */

    Reply
    • Alex – okay, good start! So if all of the Posts rows had null OwnerUserIds, your query wouldn’t return any rows. Could we implement foreign keys in that situation?

      Reply
  • Since the question isn’t ‘how many are there’ but ‘are there any’:

    select top(1) p.Id, p.owneruserid, u.id from dbo.posts p left outer join dbo.users u on u.id=p.OwnerUserId
    where u.Id IS NULL

    Reply
  • […] Query Exercise: Find Foreign Key Problems (Brent Ozar) […]

    Reply
  • Database used: StackOverflow2013 in compat level 170 (SQL 2022) with SQL Dev edition. No extra indexes have been added to the database.

    Since we don’t have any foreign key enforcing data integrity, before writing a single query, so we assume that the worst case scenario is true (at least until proven otherwise):
    1. Posts exist without a valid OwnerUserId;
    2. Comments exist without a valid PostId and/or UserId.

    And we assume that these anomalies will continue to happen until we review and update the app code. Since the StackOverflow database that is used for this exercise is “old” we’ll use the highest date from the posts or comments table as present day.
    We’ll also assume that we have full control over the application code and database.

    As we are considering a major change to the database and the tables within we want to be as complete as possible so that we can present a proper bussiness case to the stakeholders.

    —–

    /*** Determine “today” ***/
    SELECT
    Today = CONVERT(nvarchar(10), MAX(Today.Date), 121)
    FROM
    (
    SELECT Date = MAX(CreationDate)
    FROM dbo.Posts
    UNION ALL
    SELECT Date = MAX(CreationDate)
    FROM dbo.Comments
    ) AS Today;

    /*** Today is: 2014-01-01 (2013-12-31 + 1 day) ***/

    /***
    So we start by going through some checks with SQL:
    ***/

    /*** Step one: Determine the scope ***/
    /* Does a post exist without a valid user? */
    SELECT TOP 1
    Issue = ‘PostsWithoutValidUser’
    ,p.Id
    FROM dbo.Posts AS p
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.OwnerUserId = u.Id
    )
    UNION ALL
    /* Does a comment exist without a valid user? */
    SELECT TOP 1
    Issue = ‘CommentsWithoutValidUser’
    ,c.Id
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE c.UserId = u.Id
    )
    UNION ALL
    /* Does a comment exist without a valid post? */
    SELECT TOP 1
    Issue = ‘CommentsWithoutValidPost’
    ,c.Id
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE c.PostId = p.Id
    );

    /*** Note: If any of these queries don’t return a result further checks are not required ***/
    /* All queries have at least 1 result */

    /*** Step 2: Determine the size and “age” of the issues ***/
    /* Let’s retrieve some metadata about the Posts */
    WITH PostsMetadata AS
    (
    SELECT
    Number = COUNT(*)
    ,FirstDate = MIN(p.CreationDate)
    ,LastDate = MAX(p.CreationDate)
    FROM dbo.Posts AS p
    ),
    /* Let’s retrieve data about the Posts without users */
    PostsWithoutValidUser AS
    (
    SELECT
    NumberOfIssues = COUNT(*)
    ,FirstOccurance = MIN(p.CreationDate)
    ,LastOccurance = MAX(p.CreationDate)
    FROM dbo.Posts AS p
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.OwnerUserId = u.Id
    )
    )
    /* Let’s retrieve some metadata about the Comments */
    , CommentsMetadata AS
    (
    SELECT
    Number = COUNT(*)
    ,FirstDate = MIN(c.CreationDate)
    ,LastDate = MAX(c.CreationDate)
    FROM dbo.Comments AS c
    )
    /* Let’s retrieve data about the Comments without users */
    , CommentsWithoutValidUser AS
    (
    SELECT
    NumberOfIssues = COUNT(*)
    ,FirstOccurance = MIN(c.CreationDate)
    ,LastOccurance = MAX(c.CreationDate)
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE c.UserId = u.Id
    )
    )
    /* Let’s retrieve data about the Posts without posts */
    , CommentsWithoutValidPosts AS
    (
    SELECT
    NumberOfIssues = COUNT(*)
    ,FirstOccurance = MIN(c.CreationDate)
    ,LastOccurance = MAX(c.CreationDate)
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE c.PostId = p.Id
    )
    )
    SELECT
    Issue = ‘PostsWithoutUser’
    ,pMeta.Number
    ,pMeta.FirstDate
    ,pMeta.LastDate
    ,pNoUser.NumberOfIssues
    ,pNoUser.FirstOccurance
    ,pNoUser.LastOccurance
    ,PercentageOfPosts = CAST(ROUND(CAST(pNoUser.NumberOfIssues AS decimal(14, 2)) / CAST(pMeta.Number AS decimal(14, 2)) * 100, 2) AS decimal(14, 2))
    FROM PostsMetadata AS pMeta
    CROSS APPLY PostsWithoutValidUser AS pNoUser
    UNION ALL
    SELECT
    Issue = ‘CommentsWithoutUser’
    ,cMeta.Number
    ,cMeta.FirstDate
    ,cMeta.LastDate
    ,cNoUser.NumberOfIssues
    ,cNoUser.FirstOccurance
    ,cNoUser.LastOccurance
    ,PercentageOfComments = CAST(ROUND(CAST(cNoUser.NumberOfIssues AS decimal(14, 2)) / CAST(cMeta.Number AS decimal(14, 2)) * 100, 2) AS decimal(14, 2))
    FROM CommentsMetadata AS cMeta
    CROSS APPLY CommentsWithoutValidUser AS cNoUser
    UNION ALL
    SELECT
    Issue = ‘CommentsWithoutPost’
    ,cMeta.Number
    ,cMeta.FirstDate
    ,cMeta.LastDate
    ,cNoPost.NumberOfIssues
    ,cNoPost.FirstOccurance
    ,cNoPost.LastOccurance
    ,PercentageOfComments = CAST(ROUND(CAST(cNoPost.NumberOfIssues AS decimal(14, 2)) / CAST(cMeta.Number AS decimal(14, 2)) * 100, 2) AS decimal(14, 2))
    FROM CommentsMetadata AS cMeta
    CROSS APPLY CommentsWithoutValidPosts AS cNoPost;

    /*
    Now we have some idea on how big the issues are then when they occured we can plan a way forward.
    All three anomalies have occurances in recent history as all have at least 1 occurance on 2013-12-31 (yesterday)

    The number of Comments without a valid post is very low with only 5 occurences, let’s look at these in some details:
    */

    SELECT
    c.Id
    ,c.CreationDate
    ,c.PostId
    ,c.UserId
    ,c.Text
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Posts AS p
    WHERE c.PostId = p.Id
    )
    ORDER BY c.CreationDate ASC;

    /*
    Nothing about these records seems to be out of order (PostId and UserId seem to be logical numbers)

    Possible cause, likely deletion anomalies:
    The linked post was deleted but the comment wasn’t.
    Or comments don’t get deleted when a post is deleted.

    Resolution:
    Discuss with development team what the normal happy flow is when a post gets deleted, does the app delete linked comments as well?
    If the developers confirm that comments get deleted when a post gets deleted, then the way to resolve this is to set the postId to NULL or delete the comments without posts from the database.

    After the comments have been fixed, the way is clear to implement the desired foreign key on the database level.
    However, implementing the foreign key can have a breaking effect on the application, the order in which posts deletions are carried out might need to be altered in the app as linked comments will have to be deleted before the post.

    Using a ON DELETE CASCADE or ON DELETE SET NULL on the foreign key can be used, impact of using either of these options on performance need to be tested.
    */

    /*
    The number of Comments and Posts without a valid user is higher. Too high to look into all the details, so let’s have a look at the occurences per userId.
    */

    SELECT
    Issue = ‘PostsWithoutUser’
    ,NumberOfIssues = COUNT(*)
    ,UserID = p.OwnerUserId
    FROM dbo.Posts AS p
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE p.OwnerUserId = u.Id
    )
    GROUP BY p.OwnerUserId
    UNION ALL
    SELECT
    Issue = ‘CommentsWithoutUser’
    ,NumberOfIssues = COUNT(*)
    ,UserID = c.UserId
    FROM dbo.Comments AS c
    WHERE NOT EXISTS
    (
    SELECT 1/0
    FROM dbo.Users AS u
    WHERE c.UserId = u.Id
    )
    GROUP BY c.UserId
    ORDER BY Issue ASC, NumberOfIssues DESC;

    /*
    The greatest number of Comments without a valid user are linked to a NULL value. The rest have a UserId that seems logical.
    Possible causes:
    * NULL values, probably proper handling of deleted users. Either by the app or an automated process. These are not a problem when implementing the desired foreign key.
    * NOT NULL values, probably improper handling of deleted users.

    Resolution:
    Discuss with the development team, what happens to the UserId field in the comments table when a user gets deleted?
    If the app developers don’t alter the UserId to a NULL value, then check with the database team if an automated process is running.

    Likely we want to keep the comments in the database, thus setting the UserId to NULL when the user is deleted is the prefered solution.
    So we want to update the userId in comments with an invalid userId to NULL. Then the way is clear to implement the desired foreign key on the database level.

    However, the app development team need to alter the code to update the userId in the comments when deleting a user to NULL.
    Alternatively ON DELETE SET NULL on the foreign key can be used, impact of using of this option on performance need to be tested.

    The greatest number of Posts without a valid user are linked to a zero (0) value. The rest have a OwnerUserId that seems logical.
    Possible causes:
    * Zero (0) value, probably proper handling of deleted users. Either by the app or an automated process. These are a problem when implementing the desired foreign key.
    * Non zero (0) values, probably improper handling of deleted users.

    Resolution:
    Discuss with the development team, what happens to the OwnerUserId field in the Posts table when a user gets deleted?
    If the app developers don’t alter the OwnerUserId to a zero (0) value, then check with the database team if an automated process is running.

    Likely we want to keep the posts in the database, thus setting the OwnerUserId to NULL when the user is deleted is the prefered solution.
    So we want to update the userId in posts with an invalid userId to NULL. Then the way is clear to implement the desired foreign key on the database level.

    However, the app development team need to alter the code to update the OwnerUserId in the posts when deleting a user to NULL.
    Alternatively ON DELETE SET NULL on the foreign key can be used, impact of using of this option on performance need to be tested.
    */

    Reply
    • @Montro1981 – great thought process there! I really like how you looked at the first & last occurrences of the problems in each table. Each table can have different culprits, and some of the problems might have been solved over time.

      I also like how you noted that performance tests would be required if we went down the road of cascading deletes or setting things to null.

      Great work!

      Reply
      • I have been listening/watching to your office hours video’s to have some of it rubbed off.
        And I have been dealing with enough real-world issues with app development to know at least some of the pitfalls.

        Usually, as some of us know, implementing FKs in an existing app database is more trouble than it’s worth, and it doesn’t get you that “Oh WoW factor” as applications usually don’t get any faster, rather the opposite and then it’s the DBA-er who takes the heat.

        Reply
      • And before I forget: Thanks Brent!

        Reply
        • My pleasure, and thanks for the great job on the assignment! It’s so much fun to see folks doing the work on these. I have a really good time writing the challenges.

          Reply
    • I just thought of a something: Insertion anomalies.
      If you comment on a post but while you are writing the comment the post gets deleted.

      Have that desired foreign key in place will break that Insert.
      So the app needs to validate when you post a comment (or post) if the referenced post (or user) still exists, otherwise it needs to raise an error.

      There is another solution, and that is not deleting anything, this can be do by adding a IsDeleted attribute to the tables and set that when a post or user gets deleted and filter in the app on that when displaying the posts/comments to the users.

      Reply
  • […] Your Query Exercise was to find foreign key problems: data that might stop us from successfully implementing foreign keys in the Stack Overflow database. […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.