Query Exercise Answer: Finding Email Addresses

For this week’s Query Exercise, your mission had two parts:

  1. Do a first pass search through the Users table looking for columns where people had snuck in email addresses where they didn’t belong
  2. Write a more high-quality query to identify exactly which rows we needed to handle

I wrote the challenge that way because in database work, we often write a quick-and-dirty proof of concept query to get a rough idea of what we’re dealing with. Sure, in a perfect world, we’d write a query that could capture any email address, but first let’s just get a rough idea if there’s anything that looks like an email address in the Location column:

That runs nearly instantly and returns results – some of which aren’t actually email addresses, like “@bloginfoguide.com”, but there are more than enough genuine email addresses in that list to confirm that we’ve got a real problem. (And that’s only .com emails!)

If we need to check all string columns, we could concatenate them together into one big long string, and check to see if that big long string has @ and .com in it. However, that won’t tell us which columns are problematic, and that’s going to be important if we need to do this kind of research at real-life scale – across not just all the string columns in one table, but all the tables in the database, too.

When we hand our analysis over to someone (because it’s usually a dev manager or compliance officer who needs it), they probably want it in a format like this:

PII report proof of concept

That’s where dynamic SQL and metadata come in.

For challenges like this, it’s tough to beat dynamic SQL. (Trust me, I’ve tried to beat it repeatedly. I put on boxing gloves and punch my screen all the time when I’m working with dynamic SQL, but I’m still the one who ends up huddled on the floor, crying. Wait, where were we?)

Let’s start by taking that query we just wrote, but modifying it to produce output similar to what we’re looking for, and let’s build it with dynamic SQL – but not make it metadata-driven just yet:

That produces this query:

Now, let’s get a little fancier and build this with metadata. We’ll source our object definitions from the system object INFORMATION_SCHEMA.COLUMNS:

INFORMATION_SCHEMA.COLUMNS

For my quick-and-dirty purposes, I’m going to filter for schema <> ‘sys’, because those are typically system objects. I’m also going to filter for only string columns (varchar and nvarchar), but if you’ve gotten fancy with user-defined data types, you’ll have some additional work. I trust you’ll be smart enough to figure that part out, since you’re the one who got fancy there. Anyhoo, back in my environment, this gives me enough to metadata-drive my dynamic SQL:

So now let’s put those two things together – our dynamic SQL peanut butter and our metadata chocolate:

When executed, that returns the queries we need to run against our database:

Metadata-driven queries

I like building my dynamic SQL gradually like this and testing it first manually to see if the queries are even correct. Given my typing skills and my use of alcohol, that’s not always the case, but here, these queries actually compile and run, woohoo! So now let’s turn them into dynamic SQL.

If I just built and executed them one by one, then our SSMS results tab would be chock full of lots of different result sets. I want one result set that I can copy/paste into Excel and hand to the project manager who will be dealing with this. (Ideally, I’d rather just give them the metadata-driven query so they can run it, and I won’t be responsible for copy/pasting PII into Excel, but managers sometimes just wave their hands and say, “Enough already, just put it in Excel and email it to me.”)

So I need it in one result set – to do that, I’m going to stage the data into a table. You could use a temp table for this, but I’d rather have it be restartable in case my dynamic SQL hits a snag, or in case I want to check its progress from another session.

So our script gets a little more complicated again:

Test that, and it runs. Now, it’s time to think about actually executing it.

Executing this sucks in the real world.

In small-world situations like the Users table, I wouldn’t even bother with dynamic SQL. I’d simply take the output of the above query, and run it manually in SSMS. It’d finish in a few seconds, populating the table, and I’d take that output and run with it. However, the larger your data set becomes, the more problematic this is going to be.

You could build a cursor and loop over the contents of INFORMATION_SCHEMA.COLUMNS, executing the dynamic queries one at a time, one per column. However, you don’t wanna do that because I’ve hit situations where read-only metadata queries, even with READ UNCOMMITTED and NOLOCK hints, have actually either been blocked by other queries, or been the cause of blocking outages. (My favorite was when a client had a production outage because someone was running sp_BlitzIndex at the same time someone else was creating a columnstore index.) I wouldn’t want your query to be the cause of the next production outage.

So if you want to take the cursor approach, dump the contents of INFORMATION_SCHEMA.COLUMNS into a user table or temp table first, and then iterate over that, using Aaron Bertrand’s tips on cursor tuning.

Another consideration is query performance. The way we’ve written this, we’re doing multiple passes against the same table over and over. This technique works great if you have a lot of nonclustered indexes on your string columns. However, if you don’t, or if you purposely want to do a table scan instead, AND if you have hardly any PII, this is going to suck because you’re going to do a full table scan for each column you want to check. In that situation, it’d be better to write your PII-checking query differently, more like this pseudocode:

That would require a different result-capture method, though, since it doesn’t identify which column is the one with the PII, and returns all columns. However, if you’ve got a lot of string columns, and no indexes, this query might execute faster (albeit be harder to build.)

That brings me to my last consideration: the goal of this Query Exercise wasn’t to give you a ready-to-go PII analysis solution, because after all, we’re only checking for email addresses here! We haven’t even gone into things like social security numbers or dates of birth or whatever.

Let’s zoom out.

Your original query challenge had two parts:

  1. Do a first pass search through the Users table looking for columns where people had snuck in email addresses where they didn’t belong
  2. Write a more high-quality query to identify exactly which rows we needed to handle

And we’ve already faced a lot of complexity with just working on #1, and only for email addresses! That’s why in the real world, when clients ask me about this kind of thing, I walk them through a quick-and-dirty check like this:

And if email addresses show up, our proof-of-concept check is done. We’ve got a problem: PII is where it isn’t supposed to be, and frankly, IT ALWAYS IS. Users are boundlessly creative in how they sneak PII into systems.

At that point, the project scope just exploded. We’re going to have to identify all of the PII types we’re looking for, and write robust queries to detect that PII hidden in various formats. This work is genuinely hard and not particularly performant, especially when we try to do it in T-SQL, as we covered in the recent regex post.

This is where a dedicated team, and perhaps buying a third party app, make sense. You and I aren’t going to write queries that do a good job detecting all the types of PII, especially when they’re cleverly hidden in columns like Location or masses of text in AboutMe. If our work needs to be legally defensible – and in PII cases, it does – then there’s no substitute for a team with the time to really do this topic justice.

Or, you know, just, uh, sweep it under the rug. Time to go build another feature for our app!

Previous Post
Why Your Azure SQL DB Hyperscale Bill is Higher Than You’d Expect
Next Post
[Video] Office Hours: Back in the Home Office

4 Comments. Leave new

  • John Ballentine III
    March 26, 2025 2:29 pm

    I was finally looking at this seriously, and naturally you already have an answer… Without coming up with the conditions I was looking at. Fun item – for decades now, people put in “(name ‘at’ domain)”… So I included (as one example…) WebsiteUrl LIKE ‘% at % dot %’ and came up with 265 results, all of which look likely like emails. On the other hand, I excluded the “EmailHash” because I was being silly and forgot that even though it should be hashed, programmers…

    https://gist.github.com/JohnDBallentineIII/d5bf6bda336d884db717e8b22bb01641

    So, over 3.3k results. And yes, I agree, I’d be asked to toss it into Excel and I’d happily do that and hand off to the manager that requested it.

    Reply
    • John Ballentine III
      March 26, 2025 2:32 pm

      Ok, responding too early in the morning. The “name at domain” results were all in the AboutMe column…

      Reply
  • Wayne H. Hamberg
    March 26, 2025 9:49 pm

    Brent… If you want to drain a swamp the first thing you do is to make sure there is no more water flowing into the swamp. The very first fix is to prevent any PII from entering the database in the first place and that needs to be done in GUI layer code. From the second that GUI code gets rolled into production you know that there should be no more PII fixed in the table fixed. In a large application this take several rolls into production to ensure everything is fixed. Having had to do this in large databases with PII and PHI data being rolled into development tables.

    As data checks tend to grow in number I always had table that maintained a maximum row number that I processed so I do something like “ID > +1 and grab 5,000 rows at a time to process to prevent locking rather than combine multiple checks. If I add a new check I only need to check for the rows involved. Bring down the primary key and the column that you are concerned about to your local machine then do the checking and log those rows that you have concerns about. Then the task is simple as you have a user review those concerns and those that don’t have a problem can be flagged to be resolved how you wished and those other rows can be marked as OK.

    Reply
    • Wayne – that’s great, but it’s not an answer to the challenge.

      It feels like you’re trying to write a lecture or a sermon. That’s cool, and there’s a place for that – but it’s not here. Hope that’s fair.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.