How Do You Mask Data for Secure Testing?

Data masking, data scrambling, and just plain old obfuscation: these are ways to get developers access to production-quality data for testing purposes without actually giving them real production data.  It’s much harder than it looks, though.

The Easy Part: Obfuscating Data

Developers love working with production data.  Today’s privacy-concerned companies aren’t quite so keen on giving developers complete access to that data, though, even from a restored backup of production.  They’re worried about developers dumping out a list of customers, credit card numbers, birthdays, addresses, you name it, and then selling that data to a competitor.

Companies want to quickly obfuscate production data en route to production.  They want to restore production databases in development, run some process, and then let developers test their apps without seeing real production data.

They don’t want to obfuscate every field – for example, financial transactions might want to preserve similar dollar data in order to make it easier to sanity-check reports.  We can’t have tax rates being randomized, for example, because we need those to behave predictably when looking at invoices.

Hard Part #1: Maintaining the Storage Profile with Scrambled Data

The easy way to obfuscate data is to encrypt it.  However, encrypting data tends to produce a totally different size of the data.  To see an example in action, visit this encryption demo page.  Click Random to generate a key, type something into the Plain Text, and click Encrypt:

Encryption Demo 1

Encryption Demo 1

Suddenly, the data’s much longer.  That may not be a problem for varchar fields, but it’s a hell of a problem for ints, dates, credit card numbers, etc:

Encryption Demo 2

Encryption Demo 2

The wider the data becomes, the less fits per page in SQL Server.  Encrypting client names suddenly makes each record wider, and that changes how queries perform.  Ideally, the obscured data needs to be the same size as the original data.

Hard Part #2: Maintaining Statistics Distribution With Security

In a typical American phone book, there’s a lot of people with the last name of Smith.  If we look at a typical histogram of that last name data, some last names are going to have a lot of records, and others won’t have many at all.

SQL Server builds statistics that have histograms showing distribution of data in each column, and then it uses those statistics to build execution plans.  When we test SQL Server queries in development, we want to get similar variations in query plans.

It’s easy to obfuscate data by simply randomizing it – if we have a date field, just use a random number generator – but that won’t have the same distribution as our source data.

Ideally, the obscured data needs to have a similar distribution as the original data.  If I have a People table with 1,000 records, all of which have a last name of Smith, then my obscured data should all also have the same last name.  However, if my People table has 500 people named Smith and then 500 other people with unique last names, my obscured data needs to have 501 unique last names as well, one of which will have 500 records in the table.

(Note to security-savvy readers like K. Brian Kelley: yes, if we keep the same data length and distribution, it’s fairly easy to reverse-engineer data like names.  This is the line we have to walk between real encryption and performance tuning production.)

Hard Part #3: Maintaining Referential Integrity While Masking Data

Sometimes the private data is part of the primary key of the table.  Putting database design ideas aside, the reality is that some folks have things like SocialSecurityNumber as part of the PK.  Even worse, some folks don’t put foreign key relationships in the database, so we end up with two SocialSecurityNumber fields on two different tables that need to join together.

And oh yeah – the fields may not have the same names.  Think dbo.Person.SocialSecurityNumber joining to dbo.HealthPlanSubscriber.SubscriberSocialSecurityNumber.  We can’t rely on foreign keys, either, because many folks don’t use referential integrity in the database.

Multiple databases might be involved, too.  Sometimes customers have client data in one database, sales data in another, and configuration data in yet another, and they all need to link together.

Ideally, the solution would figure out joins where it could and keep the data the same in both tables, plus also let users specify fields that are joined together even if foreign keys aren’t specified in the database.  This configuration data should be built once and saved so that users don’t have to repeat themselves every time they refresh production.

Hard Part #4: Fast Data Masking Tools

These situations often involve large amounts of data – say, a terabyte or more – and people want to refresh development with an obfuscated copy of production overnight.  This means it usually isn’t practical to export all the data to some kind of app server (SSIS included) and then push it all back in.  It also isn’t practical to update the same table repeatedly, once for each column that needs to be obfuscated.

Users will want status updates to know roughly how much of the database has been obfuscated, know how much is left to do, and know that if it breaks mid-process, they can fix a few things and then pick it back up where it left off.

Show Us Your Data Masking Best Practices

Have you run into a challenge like this out in the wild, and if so, how have you solved it?  What data masking tools have you used, and have you found a good solution?

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube

13 Responses to How Do You Mask Data for Secure Testing?
  1. Randy
    September 5, 2011 | 11:56 AM

    We’re facing all of this right now. Safenet encryption products were purchased by our parent corporation so putting that into place is mandatory. Our Dev team had identified about six fields they thought we would need to obfuscate for copies of the production database pushed out to development sql servers.
    My analysis found more like 90 fields. Our prod database is at about 40k rows and 1.2 TB with 2,437 fields in all tables.
    I made the assumption that data we’d need to encrypt or obfuscate would only be “text” type fields such as char, varchar since things like dates or numbers coming to us from our clients might contain non-date or non-numeric values that could error upon load. ( so no need to encrypt true date or true numeric fields )
    Safenet won’t encrypt true TEXT data types so the few of those we have like that may have to be converted to varchar(max).
    One manager said he thought notes placed in TEXT ( or varchar(max) ) fields by our customers ( via the GUI ) would not be our responsibility. We do see some personal info in those fields. I’m not so sure we wouldn’t be held responsible if an SSN or even credit card number was typed in there by a client.
    Safenet documentation says “primary keys are dropped during encryption and must be recreated manually.” I believe that means if you are actually encrypting the primary key ( I hope it doesn’t just drop all primary keys ).
    Other fun stuff includes the impact to LIKE statements in where clauses and anything that could be considered a range scan. We do have those.
    Safenet has another product that works with .NET so the where clause parameters can be encrypted before hitting the database — this means you don’t have to have triggers in the database for encryption/decryption. Yup, it all gets very complicated.

  2. Adam Machanic
    September 5, 2011 | 10:06 PM

    Ahh, you and hard parts.

  3. Thomas Resing
    September 6, 2011 | 8:55 AM

    Interesting question and good points I hadn’t thought about before. I’m normally the developer wanting the production data, but sometimes I play the DBA, too. Surely you have some thoughts on how to do this right? Or, is it merely a statement of how far we are from privacy safety as an industry?

  4. [...] How Do You Mask Data for Secure Testing? - Brent Ozar (Blog|Twitter) discusses the practice of Data Masking and asks how you go about it for your projects. [...]

  5. Nigel Ainscoe
    September 10, 2011 | 4:24 AM

    I get around this by simply not allowing any client data from the production system to be put into dev or test environments. I use data generation tools to create all dev and test datasets (I use the RedGate tool personally but I’m sure there are others). This takes a little more time than simply restoring a production backup but it also impresses the clients and the PCI DSS auditors when they come around.

  6. Randy
    September 10, 2011 | 6:49 AM

    Nigel, I hear what you’re saying. The most interesting thing you said, to me anyway,was “simply not allowing.” In 9 years at my present company I’ve seen precious few examples where anyone in IT could make a hard and fast rule like that.

    Things seem to be driven by other portions of the business. However, as we’ve grown and the sophistication of our clients’ own IT departments has increased, audit requirements are ratcheting up so things are changing for the better.

  7. Chuck Rummel
    September 10, 2011 | 9:46 PM

    An excellent topic to consider, and not just from the testing point of view. I second Nigel’s approach that the best way to deal with it is to not allow it to begin with, though I also agree with Randy’s 2nd comment that it’s easier said than done. Much has to do with which regulations you are subject to, PCI DSS, HIPAA, EU privacy laws, recent legislation in MA or CA, etc. In recent times all kinds of PII (personally identifiable information) have come under scrutiny, with some rules going so far to say it all has to be encrypted, period. Depending on how you do that, whether through TDE or by encrypting the data on the app side before it even gets to the db, may make some of these questions disappear (and others show in their place.) For example, if data is encrypted before getting to the db, you likely eliminate the possibility that the db can do any partial match and maybe even any range lookups. Is your app (or even your whole company) ready for that?

  8. Alex
    September 12, 2011 | 12:08 PM

    I just found a cool tool for generating data at http://sourceforge.net/projects/generatedata/

    It can be used to replace the identifiable bits.

    • Brent Ozar
      September 12, 2011 | 12:10 PM

      Alex – thanks. There’s plenty of data *generation* tools out there, but the challenge is replacing things like primary keys that are actually the identifiable data. The data may be present in several tables, and if you read the post a little more closely, you’ll see that you can’t simply replace the data randomly in each table. (sigh) This is tougher than it looks at first glance!

  9. Dennis Suchta
    September 23, 2011 | 11:27 AM

    Is there any way to mask the data and preserve statistics? Have come across the question several times but never a putative solution.

    Thanks – Dennis

    • Brent Ozar
      September 23, 2011 | 11:31 AM

      Dennis – sort of. You can script out the statistics of the database, but those statistics will include some of the data (for example customer names) so no.

  10. Arthur (@ArthurZ)
    February 29, 2012 | 10:06 AM

    For some basic data masking using SSIS you can use my free Data Masker Data Flow Component http://ssisdatamasker.codeplex.com/

Leave a Reply

Wanting to leave an <em>phasis on your comment?

Notify me of followup comments via e-mail. You can also subscribe without commenting.