Query Exercise: Looking for Email Addresses

Personally identifiable information (PII) is freakin’ everywhere.
When companies first start looking to identify and lock down their data, they think it’s going to be as easy as identifying common columns like EmailAddress, DateOfBirth, SocialSecurityNumber, and so forth. They think, “We’ll just encrypt those columns and we’ll be fine.”
Ho ho, my sweet summer child. No, that’s not nearly enough.
Your users are completely diabolical. If you give a user a text box, they will find amazingly creative ways to stuff your database chock full of PII that you didn’t ask for, and you certainly don’t want. One of my favorite client examples was a help desk ticketing system where the support team routinely logged things like usernames, passwords, and other PII mixed in with their case notes. (Example: “Person needs help resetting their account, but can’t find their account. Tried searching under Jane Doe, Doe Jane, 123 Main Street, even tried their DOB of 1/2/1999, but I can’t find them, even though their credit card is getting billed under 5432-1234-9876-4567.”)
So when you start the journey of making your database safer by identifying and handling PII, it’s up to you to proactively dig through your database looking for this stuff. This kind of research work is often handed to database administrators because they’re trusted to look everywhere in the database, in every row and column. Plus, doing this kind of search is more efficient inside the database itself rather than on the app side – since it involves looking at every row and column.
This week, you’re that DBA, assigned that project.
Take any Stack Overflow database and examine the Users table. To keep things simple, I’m only going to ask you to look for email addresses.
This week’s challenge is a two-parter:
- Write a one-time query to check the Users table for email addresses. Do we have a problem with users putting email addresses where they don’t belong, and if so, what columns do we need to dig deeper into?
- Assuming we have a problem, write a query to find exactly which rows we need to address. Perhaps we’ll send them emails (ha ha ho ho) to log in and correct their profile, but we’re not actually going to modify their data.
Put your queries in a Github Gist and include those link in your comments. Check out the solutions from other folks, compare and contrast your work, and next week I’ll circle back with my answers & thoughts. Have fun!
Related

Hi! I’m Brent Ozar.
I make Microsoft SQL Server go faster. I love teaching, travel, cars, and laughing. I’m based out of Las Vegas. He/him. I teach SQL Server training classes, or if you haven’t got time for the pain, I’m available for consulting too.
Get Free SQL Stuff
"*" indicates required fields

12 Comments. Leave new
there was that time when I went to talk to the pension advisor at my local Credit Union and they turned their screen around and lo and behold there was my 2005 vintage password in the notes field.
apparently people were forgetting their passwords so often that their IT team decided to make life for the call centre easier by putting the password in the notes field of their CRM. Given they were a customer and all the grief they gave me with their security docs I had a serious sense of humour failure.
My attempt 😉
https://gist.github.com/JohanBrwrts/b0ed76467a15ec06de638ce5f7715d90
Meh, I might have went too hard…
https://gist.github.com/sql-king-za/096f11fa95fccb5fdbe0d10bfcdeea4e
hi all.
Dynamic SQL been use. main parameters (table name and field type) should be specified in template (!!! anchor). Some stuff could be extended (e.g. minimum column length, exclude columns etc)
Result is:
Id (primary key)
checked_column – column where email been found
checked_value – value of the column
https://gist.github.com/DeesyCom/caa1acabaeb39bc3788f55dcc64c3068
[…] Query Exercise: Looking for Email Addresses (Brent Ozar) […]
[…] this week’s Query Exercise, your mission had two […]
Version 2 of my query. Version 1 not published. It was turning into a State Machine built with CASE statements.
This version finds “valid”, if not well-formed, Local-Parts, then checks if the Domain name is mostly valid, according to RFC 5322 and related standards.
Carp. Used embedded link instead of URL.
https://gist.github.com/Baskaa/0dc06c2c66a88b8643d8e3b0c962256b
Personal Note: User 177973 Matthijs deserves more than 1 downvote.
Version 3
New and Improved! Now with Less Filtering, More Formatting, and Added Totals per Column! Still with the sweet, sweet Recursion we all love!
https://gist.github.com/Baskaa/eff96ac110598ee060a579b7964f12ad
[…] find out if any sneaky users have been putting PII in places that it shouldn’t belong, like this week’s Query Exercise challenge. We want to find all Users.Location values where the user put in something resembling an email […]