How Do You Mask Data for Secure Testing?

Last Updated September 1, 2011

Data masking, data scrambling, and just plain old obfuscation: these are ways to get developers access to production-quality data for testing purposes without actually giving them real production data. It’s much harder than it looks, though.

The Easy Part: Obfuscating Data

Developers love working with production data. Today’s privacy-concerned companies aren’t quite so keen on giving developers complete access to that data, though, even from a restored backup of production. They’re worried about developers dumping out a list of customers, credit card numbers, birthdays, addresses, you name it, and then selling that data to a competitor.

Companies want to quickly obfuscate production data en route to production. They want to restore production databases in development, run some process, and then let developers test their apps without seeing real production data.

They don’t want to obfuscate every field – for example, financial transactions might want to preserve similar dollar data in order to make it easier to sanity-check reports. We can’t have tax rates being randomized, for example, because we need those to behave predictably when looking at invoices.

Hard Part #1: Maintaining the Storage Profile with Scrambled Data

The easy way to obfuscate data is to encrypt it. However, encrypting data tends to produce a totally different size of the data. To see an example in action, visit this encryption demo page. Click Random to generate a key, type something into the Plain Text, and click Encrypt:

Suddenly, the data’s much longer. That may not be a problem for varchar fields, but it’s a hell of a problem for ints, dates, credit card numbers, etc:

The wider the data becomes, the less fits per page in SQL Server. Encrypting client names suddenly makes each record wider, and that changes how queries perform. Ideally, the obscured data needs to be the same size as the original data.

Hard Part #2: Maintaining Statistics Distribution With Security

In a typical American phone book, there’s a lot of people with the last name of Smith. If we look at a typical histogram of that last name data, some last names are going to have a lot of records, and others won’t have many at all.

SQL Server builds statistics that have histograms showing distribution of data in each column, and then it uses those statistics to build execution plans. When we test SQL Server queries in development, we want to get similar variations in query plans.

It’s easy to obfuscate data by simply randomizing it – if we have a date field, just use a random number generator – but that won’t have the same distribution as our source data.

Ideally, the obscured data needs to have a similar distribution as the original data. If I have a People table with 1,000 records, all of which have a last name of Smith, then my obscured data should all also have the same last name. However, if my People table has 500 people named Smith and then 500 other people with unique last names, my obscured data needs to have 501 unique last names as well, one of which will have 500 records in the table.

(Note to security-savvy readers like K. Brian Kelley: yes, if we keep the same data length and distribution, it’s fairly easy to reverse-engineer data like names. This is the line we have to walk between real encryption and performance tuning production.)

Hard Part #3: Maintaining Referential Integrity While Masking Data

Sometimes the private data is part of the primary key of the table. Putting database design ideas aside, the reality is that some folks have things like SocialSecurityNumber as part of the PK. Even worse, some folks don’t put foreign key relationships in the database, so we end up with two SocialSecurityNumber fields on two different tables that need to join together.

And oh yeah – the fields may not have the same names. Think dbo.Person.SocialSecurityNumber joining to dbo.HealthPlanSubscriber.SubscriberSocialSecurityNumber. We can’t rely on foreign keys, either, because many folks don’t use referential integrity in the database.

Multiple databases might be involved, too. Sometimes customers have client data in one database, sales data in another, and configuration data in yet another, and they all need to link together.

Ideally, the solution would figure out joins where it could and keep the data the same in both tables, plus also let users specify fields that are joined together even if foreign keys aren’t specified in the database. This configuration data should be built once and saved so that users don’t have to repeat themselves every time they refresh production.

Hard Part #4: Fast Data Masking Tools

These situations often involve large amounts of data – say, a terabyte or more – and people want to refresh development with an obfuscated copy of production overnight. This means it usually isn’t practical to export all the data to some kind of app server (SSIS included) and then push it all back in. It also isn’t practical to update the same table repeatedly, once for each column that needs to be obfuscated.

Users will want status updates to know roughly how much of the database has been obfuscated, know how much is left to do, and know that if it breaks mid-process, they can fix a few things and then pick it back up where it left off.

Show Us Your Data Masking Best Practices

Have you run into a challenge like this out in the wild, and if so, how have you solved it? What data masking tools have you used, and have you found a good solution?

Indexing for Deletes

Automated Tiered Storage for Databases

23 Comments. Leave new

Randy
September 5, 2011 11:56 am

We’re facing all of this right now. Safenet encryption products were purchased by our parent corporation so putting that into place is mandatory. Our Dev team had identified about six fields they thought we would need to obfuscate for copies of the production database pushed out to development sql servers.
My analysis found more like 90 fields. Our prod database is at about 40k rows and 1.2 TB with 2,437 fields in all tables.
I made the assumption that data we’d need to encrypt or obfuscate would only be “text” type fields such as char, varchar since things like dates or numbers coming to us from our clients might contain non-date or non-numeric values that could error upon load. ( so no need to encrypt true date or true numeric fields )
Safenet won’t encrypt true TEXT data types so the few of those we have like that may have to be converted to varchar(max).
One manager said he thought notes placed in TEXT ( or varchar(max) ) fields by our customers ( via the GUI ) would not be our responsibility. We do see some personal info in those fields. I’m not so sure we wouldn’t be held responsible if an SSN or even credit card number was typed in there by a client.
Safenet documentation says “primary keys are dropped during encryption and must be recreated manually.” I believe that means if you are actually encrypting the primary key ( I hope it doesn’t just drop all primary keys ).
Other fun stuff includes the impact to LIKE statements in where clauses and anything that could be considered a range scan. We do have those.
Safenet has another product that works with .NET so the where clause parameters can be encrypted before hitting the database — this means you don’t have to have triggers in the database for encryption/decryption. Yup, it all gets very complicated.

Reply
Adam Machanic
September 5, 2011 10:06 pm

Ahh, you and hard parts.

Reply
- Brent Ozar
  September 6, 2011 7:20 am
  
  HAHAHA, zing.
  
  Reply
Thomas Resing
September 6, 2011 8:55 am

Interesting question and good points I hadn’t thought about before. I’m normally the developer wanting the production data, but sometimes I play the DBA, too. Surely you have some thoughts on how to do this right? Or, is it merely a statement of how far we are from privacy safety as an industry?

Reply
Nigel Ainscoe
September 10, 2011 4:24 am

I get around this by simply not allowing any client data from the production system to be put into dev or test environments. I use data generation tools to create all dev and test datasets (I use the RedGate tool personally but I’m sure there are others). This takes a little more time than simply restoring a production backup but it also impresses the clients and the PCI DSS auditors when they come around.

Reply
Randy
September 10, 2011 6:49 am

Nigel, I hear what you’re saying. The most interesting thing you said, to me anyway,was “simply not allowing.” In 9 years at my present company I’ve seen precious few examples where anyone in IT could make a hard and fast rule like that.

Things seem to be driven by other portions of the business. However, as we’ve grown and the sophistication of our clients’ own IT departments has increased, audit requirements are ratcheting up so things are changing for the better.

Reply
Chuck Rummel
September 10, 2011 9:46 pm

An excellent topic to consider, and not just from the testing point of view. I second Nigel’s approach that the best way to deal with it is to not allow it to begin with, though I also agree with Randy’s 2nd comment that it’s easier said than done. Much has to do with which regulations you are subject to, PCI DSS, HIPAA, EU privacy laws, recent legislation in MA or CA, etc. In recent times all kinds of PII (personally identifiable information) have come under scrutiny, with some rules going so far to say it all has to be encrypted, period. Depending on how you do that, whether through TDE or by encrypting the data on the app side before it even gets to the db, may make some of these questions disappear (and others show in their place.) For example, if data is encrypted before getting to the db, you likely eliminate the possibility that the db can do any partial match and maybe even any range lookups. Is your app (or even your whole company) ready for that?

Reply
Alex
September 12, 2011 12:08 pm

I just found a cool tool for generating data at http://sourceforge.net/projects/generatedata/

It can be used to replace the identifiable bits.

Reply
- Brent Ozar
  September 12, 2011 12:10 pm
  
  Alex – thanks. There’s plenty of data *generation* tools out there, but the challenge is replacing things like primary keys that are actually the identifiable data. The data may be present in several tables, and if you read the post a little more closely, you’ll see that you can’t simply replace the data randomly in each table. (sigh) This is tougher than it looks at first glance!
  
  Reply
Dennis Suchta
September 23, 2011 11:27 am

Is there any way to mask the data and preserve statistics? Have come across the question several times but never a putative solution.

Thanks – Dennis

Reply
- Brent Ozar
  September 23, 2011 11:31 am
  
  Dennis – sort of. You can script out the statistics of the database, but those statistics will include some of the data (for example customer names) so no.
  
  Reply
Arthur (@ArthurZ)
February 29, 2012 10:06 am

For some basic data masking using SSIS you can use my free Data Masker Data Flow Component http://ssisdatamasker.codeplex.com/

Reply
Brent Ozar
December 6, 2013 6:51 am

Terry – thanks for the comment. Obviously I can’t speak to Dataveil’s capabilities on any of those fronts, but I got a huge chuckle out of this:

“DataVeil is very fast. All the masking takes place on the DBMS server.”

Ah, if only it were true that everything that happens on a database server is very fast. 😉

Reply
- Virginia
  December 31, 2013 2:16 am
  
  Brent,
  
  we completely agree with the last comment. We would like to say, that we also are a masking vendor, and very specific to SQL Server. We do agree that one should not do specific operations in the database, such as string operations, and overall, the less is done in the Database, the better its performance. Thus, we don’t do any masking functions in the DB.
  
  Re: referential integrity…. I would challenge the assumption that only with discovery tools one can preserve it, there exist other, implicit ways, based on the semantics attributes ( which all these tools use explicitly by design). Anybody is welcome to “discover” how by calling us (855-YOU-HUSH), visiting our site, http://mask-me.net, or by visiting one of our upcoming webinars. These webinars will also outline why discovery should happen before and not after the fields are introduced into the database. It is simply data security processes common sense.
  
  However, one does need to discover the proper order of operations in order to maintain referential integrity in the process of masking. A lot of tools in database architecture such as ER Studio already allow it, and if the referential integrity is in the code layer, see above.
  
  Besides, I would be very wary of only using “discovery” tools. They IMHO provide a false sense of security. Many reasons, but the very first one that comes to mind is that human fantasy is so vast and variability is that much, one can’t guarantee the proper discovery for de-identification with the tool only. HIPAA addresses such predicament with the specific language in the requirements. Majority of the fields will be covered by using the tool, but the devil is often in the detail, just ask Target or MoneyGram.
  
  If you wonder how you can do data masking within the SQL server stack of technology by drag and drop capability, without learning any new tools, please visit our site as well: Mask-me.net, and order your trial version. We listened to what you told us during SQL Saturdays and implemented exactly what you requested. We do not do it for free, but you get your personalized security, while you practically need to learn nothing and practically code and configure nothing beyond drag and drop.This much we confirmed with you during SQL Saturdays. Your time is your employer’s money, isn’t it?
  
  and thank you, Brent, for the wonderful article!
  
  Reply
Virginia
May 9, 2014 11:25 am

I was thinking about what you said about it not being practical to move data into application layer in the situations that involve large amounts of data to refresh development with an obfuscated copy of production overnight.

I have found this not to be an issue when we use our tool under several different architectures. Depending on the expected volume, one scenario would BCP all the data originally with later on BCP-ing differentials ( sometimes called deltas) in and merging, another would be when you would expect consistent terabyte-level increase, you would already have your tables partitioned for performance, like in this article http://technet.microsoft.com/en-us/library/dd537533%28v=sql.100%29.aspx . I’m wondering how that compares to what you’ve seen or experienced. Could you comment?

Reply
parul
May 4, 2016 7:04 am

i m encrypting the coloumn but for same unique key it is giving different encrypted value.I want it give me same encrypted value for same id like id is 123 its encrypted value should be same in all time whenever i encrypt it I am doing it in postgresql 9.1.please give me some proper solution….

Reply
- Brent Ozar
  May 4, 2016 8:10 am
  
  Parul – this blog is about Microsoft SQL Server. For PostgreSQL help, head on over to http://DBA.StackExchange.com.
  
  Reply
Virginia Mushkatbat
August 17, 2016 8:47 am

Brent, I hope you do not mind HushHush helping Parul. Parul, we can help you with your request on Postgres, this is the situation when SSIS is indeed superior to SQL Server engine as the components are database agnostic.

Cheers, Virginia

Reply
- Brent Ozar
  August 17, 2016 8:54 am
  
  Virginia – I’d rather companies not use our blog comments to sell their products. Thanks.
  
  Reply
Abduraghman Alexander
July 20, 2018 4:52 am

Has anyone here use SQL Data Masker to achieve this?…If not, what other solutions are there? At the moment we manual obfuscate the data with a script that someone else wrote and we are now left to maintain, there has to be a better way, right?…

Reply
- Brent Ozar
  July 20, 2018 5:08 am
  
  That’s what the post is about, explaining why there’s not a great magic button to make this stuff work. Redgate has a new beta out of their next-generation Data Masking tool – normally I’d say to avoid beta apps in production, but since this is the kind of tool you’d likely use in dev only anyway, why not give it a spin: https://www.red-gate.com/products/dba/data-masker/entrypage/projectfigleaf
  
  Reply
Kim Crosser
September 5, 2022 9:25 pm

Yes – this post is really old, but this may be of use to someone…

I had to “scrub” a law enforcement database so that it could be shared with a vendor to allow them to test new product releases against data. The objectives were to mask all PII, any other identifying information, and any narrative/free text information, BUT maintain (as much as possible) the data distributions, so that the vendor software would be operating on “like real” data for performance testing.

Name fields were actually pretty easy. I downloaded the U.S. Census data for first and last names. The conversion process generated a mapping table for each source name value to a pseudo-randomly selected target name value (the “random” selection actually used the Census frequency to weight the selection, so names with frequencies like for “Smith” were far more likely to be picked than “Mxyzptlk”. (I used a simple, stupid algorithm – convert frequencies to integer values, then populate a static selection table based on the integer value – if the frequency converted to 1000, then fill 1000 slots with that name.) Then, as each data record was processed do a random pick from the entire table. (N.b., for each “Case”, the mapping was recorded in a translation table, so that all internal references to “John Smith” in a single Case wound up referring to the same name, but “John Smith” in another Case would likely point to a different name. (Yes – this broke some cross-Case associations, but that was deemed acceptable.)
ID fields (DLN, SSN, etc.) were simply replaced by randomized strings of the same lengths, as these are rarely searched by other than an exact value, so distribution frequency wasn’t an issue.
Date of Birth fields were randomized to another date within a +/- 6-month range of the original date. (Note that this did have a minor issue with some minors becoming adults and vice versa, but that was deemed acceptable.)
Free text fields were simply replaced with a static string of “lorem ipsum dolor…” text of approximately the same length as the original (the “lorem ipsum dolor” static text was repeated as necessary for long text fields).
Addresses were tricky – geovalidation won’t work if the street address, city, and zip don’t match. So each address was written to a mapping table, with the actual house number (or P.O./PMB Box number) randomized to a number in the same approximate range (e.g., 4000 Main could wind up as anything between 3001 and 5999 Main). Then, during conversion, a cross-reference of source addresses to randomly selected target addresses was created. So 1234 Main St could wind up translating to 7043 First Ave, but every reference to a specific source address would translate to the same target address.
Event dates were the most difficult, as certain relative date constraints were required to be maintained (e.g., no “arrests” could occur before the “incident” occurred). This required a very custom process that collected all the dates for a “Case” in date order, then applied a semi-random adjustment of the date to each date in turn, ensuring that each subsequent date did not overlap the preceding one. This translation was done on a per-Case basis, so there was no consistent pattern to be found in the generated dates.

After the “scrubbed” data was created, the mapping tables were all dropped. Between the random selections and the removal of the mapping tables, there was virtually no way to recreate the original information or identify the real persons involved. However, the data distribution of the critical fields was maintained, so performance testing could be performed as though on the live production system.

The agency’s oversight team reviewed the approach and the resulting data and approved it for export to the vendor.

Reply
- Thomas Franz
  September 6, 2022 1:45 pm
  
  sounds a very solid solution.
  
  The only weak point I see at first glance could be if you COPY the database (instead of exporting / backing up + restoring), because the mapping tables would be still in the unused space of databases (if you drop something, the pages will just be marked as unused and not be overwritten; a backup will skip those pages so a restore would be fine too).
  
  Reply