Masked Data

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
diamondabhi
Premium Member
Premium Member
Posts: 108
Joined: Sat Feb 05, 2005 6:52 pm
Location: US

Masked Data

Post by diamondabhi »

Hi All,
Is there any technique in DataStage that will allow us to masked Productional data to create test data? We have a project that consists of sensitive data - ssn, hourly wage, name, address, etc. We would like to use good test data that is not production data.

Thanks,
Abhi.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

In the past I've written a number of jobs that take the "live" data and run it through a transform which either randomizes columns or masks information, depending upon what needs to be done. In many cases randomizing would ruin the correlations (i.e. if the customer name is the key for a lookup), so a simple text-based encoding would be used (i.e. subtract 13 from the ascii value of each character) in all occurrences of that name. This can end up being a lot of work if the keys are identical with the data that should be masked. Then again, the big project I was working on was with millions of credit card records and transactions, so the anonymizing step was quite important during the initial test phase.
diamondabhi
Premium Member
Premium Member
Posts: 108
Joined: Sat Feb 05, 2005 6:52 pm
Location: US

Post by diamondabhi »

ArndW,
Can you be more elaborate as to how to mask the data or if have already built some jobs can you post some codes for me.

Thanks,
Sai.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Diamondabhi,

I won't post any jobs or code for this, but will elaborate a bit:

If you have numerics that need to be masked use the RND() function which returns a pseudo-random number [it is important to seed it with a constant value if you need reproduceable numbers on subsequent runs]. So if you have a currency amount you want masked (i.e. 45432.00) I would either do a completely random number as in RND(9999999)/100 or RND(100)/10*In.Currency to use the original in some fashion. Text fields with names that need to stay consistant between tables (key fields) can be masked:

Code: Select all

Function Mask(InputString)
StringLen = LEN(InputString)
Ans = ''
FOR i = 1 TO StringLen
  Modulus = MOD(i,3)
          IF  Modulus=0 THEN Ans := CHAR(SEQ(InputString[i,1])-1)
  ELSE IF  Modulus=1 THEN Ans := CHAR(SEQ(InputString[i,1])+1)
                               ELSE  Ans := InputString[i,1]
NEXT i
This example is a bit too simple, but it should give you an idea of what can be done. You could even go in and use a SOUNDEX conversion, but that would give you a lot of duplicates.

Perhaps you could explain if you need a simple or complex masking done - and specifically for which type of constructs.
diamondabhi
Premium Member
Premium Member
Posts: 108
Joined: Sat Feb 05, 2005 6:52 pm
Location: US

Post by diamondabhi »

ArndW,
Thanks a lot Arnd, I need simple coding for masking the SSN and EmpNames.

Thanks,
Abhi.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

If all you want is a few rows in the file, you can generate yourself using the transformer and a stage variable and make it to write them out in any file you want.

You can pass your SSN to EmpName to match them during testing.
ketfos
Participant
Posts: 562
Joined: Mon May 03, 2004 8:58 pm
Location: san francisco
Contact:

Post by ketfos »

Hi,
You could use a simple Ereplace function in the transformer where you can replace occurences of string with another.

Thks
Ketfos
Post Reply