Data masking, data scrambling, and just plain old obfuscation: these are ways to get developers access to production-quality data for testing purposes without actually giving them real production data. It’s much harder than it looks, though.
The Easy Part: Obfuscating Data
Developers love working with production data. Today’s privacy-concerned companies aren’t quite so keen on giving developers complete access to that data, though, even from a restored backup of production. They’re worried about developers dumping out a list of customers, credit card numbers, birthdays, addresses, you name it, and then selling that data to a competitor.
Companies want to quickly obfuscate production data en route to production. They want to restore production databases in development, run some process, and then let developers test their apps without seeing real production data.
They don’t want to obfuscate every field – for example, financial transactions might want to preserve similar dollar data in order to make it easier to sanity-check reports. We can’t have tax rates being randomized, for example, because we need those to behave predictably when looking at invoices.
Hard Part #1: Maintaining the Storage Profile with Scrambled Data
The easy way to obfuscate data is to encrypt it. However, encrypting data tends to produce a totally different size of the data. To see an example in action, visit this encryption demo page. Click Random to generate a key, type something into the Plain Text, and click Encrypt:
Suddenly, the data’s much longer. That may not be a problem for varchar fields, but it’s a hell of a problem for ints, dates, credit card numbers, etc:
The wider the data becomes, the less fits per page in SQL Server. Encrypting client names suddenly makes each record wider, and that changes how queries perform. Ideally, the obscured data needs to be the same size as the original data.
Hard Part #2: Maintaining Statistics Distribution With Security
In a typical American phone book, there’s a lot of people with the last name of Smith. If we look at a typical histogram of that last name data, some last names are going to have a lot of records, and others won’t have many at all.
SQL Server builds statistics that have histograms showing distribution of data in each column, and then it uses those statistics to build execution plans. When we test SQL Server queries in development, we want to get similar variations in query plans.
It’s easy to obfuscate data by simply randomizing it – if we have a date field, just use a random number generator – but that won’t have the same distribution as our source data.
Ideally, the obscured data needs to have a similar distribution as the original data. If I have a People table with 1,000 records, all of which have a last name of Smith, then my obscured data should all also have the same last name. However, if my People table has 500 people named Smith and then 500 other people with unique last names, my obscured data needs to have 501 unique last names as well, one of which will have 500 records in the table.
(Note to security-savvy readers like K. Brian Kelley: yes, if we keep the same data length and distribution, it’s fairly easy to reverse-engineer data like names. This is the line we have to walk between real encryption and performance tuning production.)
Hard Part #3: Maintaining Referential Integrity While Masking Data
Sometimes the private data is part of the primary key of the table. Putting database design ideas aside, the reality is that some folks have things like SocialSecurityNumber as part of the PK. Even worse, some folks don’t put foreign key relationships in the database, so we end up with two SocialSecurityNumber fields on two different tables that need to join together.
And oh yeah – the fields may not have the same names. Think dbo.Person.SocialSecurityNumber joining to dbo.HealthPlanSubscriber.SubscriberSocialSecurityNumber. We can’t rely on foreign keys, either, because many folks don’t use referential integrity in the database.
Multiple databases might be involved, too. Sometimes customers have client data in one database, sales data in another, and configuration data in yet another, and they all need to link together.
Ideally, the solution would figure out joins where it could and keep the data the same in both tables, plus also let users specify fields that are joined together even if foreign keys aren’t specified in the database. This configuration data should be built once and saved so that users don’t have to repeat themselves every time they refresh production.
Hard Part #4: Fast Data Masking Tools
These situations often involve large amounts of data – say, a terabyte or more – and people want to refresh development with an obfuscated copy of production overnight. This means it usually isn’t practical to export all the data to some kind of app server (SSIS included) and then push it all back in. It also isn’t practical to update the same table repeatedly, once for each column that needs to be obfuscated.
Users will want status updates to know roughly how much of the database has been obfuscated, know how much is left to do, and know that if it breaks mid-process, they can fix a few things and then pick it back up where it left off.
Show Us Your Data Masking Best Practices
Have you run into a challenge like this out in the wild, and if so, how have you solved it? What data masking tools have you used, and have you found a good solution?