Let’s say someone – not you, dear reader, because you would surely never be this sloppy – is building a personal web site or a diary or a company or whatever:
CREATE TABLE dbo.People
INSERT INTO dbo.People
VALUES ('Bear', 'Papa', 45, 'email@example.com', 'honeylover11', '12127230155'),
('Bear', 'Mama', 40, 'firstname.lastname@example.org', 'forgotitagain', '05057719211'),
('Bear', 'Baby', 5, NULL, NULL, '070712355'),
('Locks', 'Goldi', 12, 'email@example.com', 'correcthorsebatterystaple', '020205366');
Yes, they should be encrypting passwords, and not storing ages, and certainly never storing identifying information like social security numbers. Unfortunately, that’s just often not how it goes in the real world, as any recent headline will tell you.
That data above could easily contain personally identifiable information (PII) – not to mention all kinds of cringeworthy data.
Queries can contain PII.
And let’s say this someone has performance problems with a query:
WHERE PersonNumber = '020205366'
AND Age = 12;
If they post that query on a public site, that’s a problem. I’ve seen plenty of similar examples in the wild for folks querying:
- Inserts with lists of explicit values, like customers, patients, and one of my favorites, lists of diseases for a specific patient
- Updates with lists of values, like set someone’s new address, the number of times they’ve been diagnosed with a particular disease, and their risk levels
- Selects looking for a customer’s name, address, email address, or incriminating evidence phrases
Yep. Welcome to the database world. This is what your queries contain.
Which means PII is everywhere, including plain text on your SQL Server’s drives.
Blocked process reports. The system health session, which includes queries involved in deadlocks. Monitoring tools.
But here’s my favorite: when your SQL Server does a stack dump, the running query gets dumped out in plain text to the log directory like this – this example is from a corruption demo when I was running a DBCC CHECKDB command:
That’s plain text, stored on a drive our user probably never even thought to encrypt, bless his heart.
Execution plans can contain PII.
Say our intrepid user posts the execution plan – it looks pretty harmless, especially when it’s auto-parameterized, right? Whether it’s a stored proc or auto-parameterized, those parameters look innocent at first glance:
But edit the query plan XML, and you see the exact parameters used:
Yep, busted. And of course, same thing with your plan cache.
Statistics can contain PII.
I’ve seen folks get excited about the ability to clone a database complete with statistics. So say our user creates an index on EmailAddress in order to make queries go faster:
CREATE INDEX IX_EmailAddress ON dbo.People(EmailAddress);
Creating an index automatically creates a stat, so if you script out the database including statistics, voila, people can see your data:
You don’t have to get a database to get PII.
PII is like a virus: the worst users send it all over, voluntarily, without even realizing it. They think they’re just getting help with a query, and in reality, they’re infecting you with something you never wanted to begin with.
As data professionals, we’re in a really ugly place: we see data we never wanted to see. Our database systems and monitoring systems simply aren’t capable of hiding data from us. If we’re going to tune queries, that means we’re going to have to see queries. Unless we truly never touch a production server – which is certainly admirable and possible – that means we’re probably going to see PII.
I wish I had a happy ending to this post. I don’t.