How Personally Identifiable Information Gets Around

Let’s say someone – not you, dear reader, because you would surely never be this sloppy – is building a personal web site or a diary or a company or whatever:

Yes, they should be encrypting passwords, and not storing ages, and certainly never storing identifying information like social security numbers. Unfortunately, that’s just often not how it goes in the real world, as any recent headline will tell you.

That data above could easily contain personally identifiable information (PII) – not to mention all kinds of cringeworthy data.

Queries can contain PII.

And let’s say this someone has performance problems with a query:

If they post that query on a public site, that’s a problem. I’ve seen plenty of similar examples in the wild for folks querying:

  • Inserts with lists of explicit values, like customers, patients, and one of my favorites, lists of diseases for a specific patient
  • Updates with lists of values, like set someone’s new address, the number of times they’ve been diagnosed with a particular disease, and their risk levels
  • Selects looking for a customer’s name, address, email address, or incriminating evidence phrases

Yep. Welcome to the database world. This is what your queries contain.

Which means PII is everywhere, including plain text on your SQL Server’s drives.

Blocked process reports. The system health session, which includes queries involved in deadlocks. Monitoring tools.

But here’s my favorite: when your SQL Server does a stack dump, the running query gets dumped out in plain text to the log directory like this – this example is from a corruption demo when I was running a DBCC CHECKDB command:

Dump in plain text

That’s plain text, stored on a drive our user probably never even thought to encrypt, bless his heart.

Execution plans can contain PII.

Say our intrepid user posts the execution plan – it looks pretty harmless, especially when it’s auto-parameterized, right? Whether it’s a stored proc or auto-parameterized, those parameters look innocent at first glance:

Looks simple enough

But edit the query plan XML, and you see the exact parameters used:

Parameters Put People in Prison

Yep, busted. And of course, same thing with your plan cache.

Statistics can contain PII.

I’ve seen folks get excited about the ability to clone a database complete with statistics. So say our user creates an index on EmailAddress in order to make queries go faster:

Creating an index automatically creates a stat, so if you script out the database including statistics, voila, people can see your data:

Statistics: hacker 1, privacy 0

You don’t have to get a database to get PII.

PII is like a virus: the worst users send it all over, voluntarily, without even realizing it. They think they’re just getting help with a query, and in reality, they’re infecting you with something you never wanted to begin with.

As data professionals, we’re in a really ugly place: we see data we never wanted to see. Our database systems and monitoring systems simply aren’t capable of hiding data from us. If we’re going to tune queries, that means we’re going to have to see queries. Unless we truly never touch a production server – which is certainly admirable and possible – that means we’re probably going to see PII.

I wish I had a happy ending to this post. I don’t.

Previous Post
Our 25 Most-Read and Most-Commented Blog Posts of 2017
Next Post
What’s So Bad About Shrinking Databases with DBCC SHRINKDATABASE?

23 Comments. Leave new

  • Ha.. I worked on a system once where the primary key was the SSN and this was a FK in many tables. Also password was not encrypted and neither was the hint
    Once the system was rewritten this was all changed.. data was salted and encrypted.. But it was kind of interesting looking at the passwords and hints for the 1st time…. felt soo dirty……

    Reply
  • It can be in your reports, too. Tableau, powerBI and Crystal can all have (some of them *must* have) the data baked into the report, ready to open as a csv.

    Drive your query with something like:

    select * from visits v
    inner join patients p on v.patient_id = p.patient_id

    and you might have all your patient’s info floating off into the ether.

    Reply
  • We had error logs that included plaintext xml column with all of the employees personal data from HR on a cloud application. It is a nightmare ready to explode. When a complaint was registered, HR denied access to the error from complainer. It resurfaced again 8 months later when another error was generated. Both of the errors were visible and both had the xml column.

    Reply
  • Cardinal Richelieu is reputed to have said something to the effect of ‘give me 6 lines by the most honest of men and I will find cause to hang him’

    The PII regulations serve that function. They pretty much guarantee that EVERY company will be guilty in some way when the regulators are looking for a scapegoat. And the cynical side of me feels that’s exactly what they want.

    Reply
    • It also assists in regulatory capture. Siemens, microsoft, etc, they can afford to comply with onerous regulations. Disruptive small competitors never get started because they can’t afford the same compliance.

      Reply
  • The frustrating part for me is when I point out the vulnerable data, and recommend some strategy to fix the problem, I am made out to be the ‘bad guy’ because I keep finding fault with the system, and making extra work for the developers. I think having a thick skin is essential to be a production DBA.

    Reply
    • This. I’m on pretty thin ice because I won’t shut up about problems. My company has dreams of going public, but we don’t meet even the weakest of best practices when it comes to actually protecting pii.

      Reply
  • Last week I had to explain what PII data was to our senior records officer. *Sigh*

    Reply
  • Little Bobby Tables
    December 28, 2017 1:27 pm

    And we now know that Ms. Locks reads XKCD as well.

    Reply
  • Not to be a buzzkill, but between government laptops being lost/stolen, data breeches (*cough Equifax*cough), and lack of knowledge at the government level (https://www.troyhunt.com/im-testifying-in-front-of-congress-in-washington-dc-about-data-breaches-what-should-i-say/) I can’t say I’m surprised to see PII leaked all over the digital world. If you took the PII black light to the proverbial internet bed sheets…..ya, it wouldn’t be good.

    Reply
  • Quis custodiet ipsos custodes has always applied to our profession.

    The responsibility, indeed the ethical _necessity_ to guard data, comes as part of that phrase you used: data professional. I know _all_ kinds of things I shouldn’t, acquired incidentally in the daily process of performing my work over 25+ years. Those little facts don’t go any further than my screen, period.

    This is one essential reason that database administrators are near the top of IT’s pay scale.

    Reply
  • Even if you take all the measures for data in memory and at rest….ultimately if you let someone query it, then sure as night follows day, you’ll find it months later in an Excel spreadsheet on 25 machines scattered throughout your organization 🙁

    Reply
  • Mohamed Ezzat
    January 1, 2018 4:07 am

    Nice Post

    Reply
  • Always good to inform others of the challenges that exist. PII and PHI represent real challenges for many. The fact is PII and PHI is required by business for many reasons. However, those who need this information / data also are responsible for protecting it while at rest or in motion. There are no magic solutions, but there are solutions nevertheless including encryption and GRC practices that help organizations keep a check on important data such as PII and PHI. PCI is a good example of an accepted practice. Although it is sponsored by the Payment Card Industry, it can be modified / adapted for protecting other types of data / information. Lastly there are tools available to scan content (source code included) for key words as well as patterns. Common examples including email scanners that scan messages and attachments for occurrences of key word use. When found the email is held and the designated staff is notified.

    I realize many of you already know this. However, for those that may not, I hope this post proves helpful.

    Reply
  • Correct me if I’m wrong, but I don’t believe that 2nd image (the query) is PII as it doesn’t identify a person, not without the rest of the dataset, as the query is just exposing an internal identifier and a rather vague data point (age). If it was exposing say first and surname that might be different.

    Reply
  • Ah, my mistake for not following closely enough

    Reply
  • Francesco Mantovani
    October 9, 2020 1:04 am

    SQL Server 2019 has a very interesting tool called “Vulnerability Assessment”. One of the assessment shows the sensitive data columns that should be classified.
    Of course the tool is based on the column name… but at least it can help you understand what to mask as soon as possible

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.