We needed to copy data from production to development, but since there is patient data, the personal information needs to be obfuscated. I couldn’t imagine this was a rare situation, yet was having trouble finding a good solution. Luckily, I did have Red Gate SQL Developer Bundle, which includes SQL Data Generator 2.
Now, I played with this tool for a long time, and didn’t think it would work. It didn’t allow me to update just selected columns with random data, it had to delete all of the data.
Then I thought, what if I create a new database with the same tables, and those CAN be deleted. Then I needed to figure out a way to update the original table with the random records. With a Google search, I found a solution from RBarry Young
I altered this, for my situation. Here is an example of using it to update several columns, like in an address for instance.
Now to back up, here are the steps
1. Create database with the table structures that include the personal or confidential data
2. Use SQL Data Generator to populate these tables with at least as many records as are in the original tables. Choose the DELETE records option. I won’t include a tutorial here – play around with it like I did, and let me know if you have questions.
3. Run the following scripts, substituting “DatabaseToChange” with your test database (copy of production in development), and substitute “DatabaseGenerated” with the name of the database that was created in step 1.
;WITH cteOriginalTable AS ( SELECT [Street1] ,[Street2] ,[City] , [State] ,[PostalCode] ,[AttnName],[ExtraAddressInfo], ROW_NUMBER() OVER (ORDER BY NEWID()) AS n FROM DatabaseToChange.dbo.[Address] ), cteGeneratedTable AS ( SELECT [Street1] ,[Street2] ,[City] , [State] ,[PostalCode] ,[AttnName],[ExtraAddressInfo], ROW_NUMBER() OVER (ORDER BY NEWID()) AS n FROM DatabaseGenerated.dbo.[Address] t21 ) UPDATE cteOriginalTable SET cteOriginalTable.Street1 = cteGeneratedTable.Street1, cteOriginalTable.Street2 = cteGeneratedTable.Street2, cteOriginalTable.City = cteGeneratedTable.City, cteOriginalTable.[State] = cteGeneratedTable.[State], cteOriginalTable.PostalCode = cteGeneratedTable.PostalCode, cteOriginalTable.AttnName = cteGeneratedTable.AttnName, cteOriginalTable.ExtraAddressInfo = cteGeneratedTable.ExtraAddressInfo FROM cteGeneratedTable WHERE cteOriginalTable.n = cteGeneratedTable.n
Then since it is personally identifiable information, I have to make really sure that EVERY record was updated. So I run this SQL to compare the fields.
SELECT A1.Street1, A2.Street1, A1.Street2, A2.Street2, A1.City, A2.City, A1.State, A2.State, A1.PostalCode, A2.PostalCode, A1.AttnName, A2.AttnName, A1.ExtraAddressInfo, A2.ExtraAddressInfo FROM DatabaseToChange.dbo.[Address] A1 INNER JOIN DatabaseGenerated.dbo.address A2 ON A1.AddressGUID = a2.AddressGUID WHERE A1.Street1 = A2.Street1 OR A1.Street2 = A2.Street2 OR A1.City = A2.City OR A1.[State] = A2.[State] OR A1.PostalCode = A2.PostalCode OR A1.AttnName = A2.AttnName OR A1.ExtraAddressInfo = A2.ExtraAddressInfo ORDER BY A1.Street1, A2.Street2