Data Scrubbing in ETL

synsog · Post by **synsog** » Fri Aug 06, 2010 11:12 am

HI Everyone,
I am Looking to create a automated process where i want to bring real time Production data to QA environment after doing a data scrubbing on PII information. The main reason to perform above task will help Testing team to setup test data on QA env.

Any idea on any function which i can use to perform data Scrubbing on any column. I am looking for a generic design which can be used for any project.

It would be great if you can share your ideas to perform above task.

Thanks in Advance !!!!

synsog · Post by **synsog** » Fri Aug 06, 2010 10:30 pm

Please advice on this post !!!

ray.wurlod · Post by **ray.wurlod** » Sat Aug 07, 2010 4:26 am

PII ?

We have no idea what "data scrubbing" means in your context. You need to begin with a written specification of what you need to achieve.

Without that no-one would be able to help you.

It is also unreasonable to demand answers here, particularly so soon after posting on a weekend. DSXchange is an all-volunteer site whose members post as and when they can.

If you want rapid response, sign up with your official support provider for priority service, and learn just how much "urgent" can cost.

Yes, the DataStage and QualityStage tool can perform various kinds of data cleansing operations but we're not going to waste time guessing what your particular requirements are.

chulett · Post by **chulett** » Sat Aug 07, 2010 7:02 am

First bit of advice is to not use abbreviations unless you are sure it is common enough that everyone will know what it means. Otherwise, it's always prudent to spell it out so there's no confusion or wrong turns made on your road to a solution. Second bit is to have some patience.

PII is something I have to deal with working with Health Care claims and the like but it can apply across a wide range of industries. It stands for "Personally Identifiable Information" and we deal with that and Personal Health Information all the time. So - name, address, social, phone, etc. As noted, as part of this you should have been given a written requirements / specifications document that details both what your identified PII/PHI elements are and what to do with them. There are all kinds of options here for your "scrubbing" and I'm not about to guess what path you need to be on here. For example, we handle:

1. Drop during import where we simple do not bring it into our warehouse.

2. Redaction where the data is replaced by meaningless crap: stars, x's or perhaps even the word 'REDACTED' much like you'll see in movies where documents have black bars over the naughty bits.

3. Encryption where the actual data is encrypted using some sort of standard encryption algorithm. Problem is it changes the data type in the target and also means the data can be 'unencrypted' if one has the key, not an ideal situation when this really should be a one-way street. Not to mention all the fun it causes the testers or consumers of the data.

5. Masking where the data is replaced with meaningful but de-identified data. This is typically both the best choice and hardest to implement as you need to programmatically replace the actual data with 'random' data of the same type. There can be a lot more to it but it has an advantage over the others in that the data is still "meaningful" and able to be tested, meaning a name still looks like a name, phone like a phone, etc - but it's not right nor real.

The last option typically involves professional consulting services, from what I recall IBM offers both that and a specialized set of routines if you engage them.

Bottom line is I'm not sure how you can build an "automated, works on any column / data type" solution for something like this. But, again as noted, we'd have to know what you mean by "scrubbing" in this context. Then maybe people could provide more cogent advice.

chulett · Post by **chulett** » Sat Aug 07, 2010 7:06 am

I'm going to leave this post in the General forum for now, as it applies equally across products but may move it if this goes all PX only on us... stay tuned...

ray.wurlod · Post by **ray.wurlod** » Sat Aug 07, 2010 2:03 pm

Please respond on this post!!! We must have our questions answered.

chulett · Post by **chulett** » Sat Aug 07, 2010 4:33 pm

You funny.

synsog · Post by **synsog** » Sun Aug 08, 2010 8:41 pm

ray.wurlod wrote:Please respond on this post!!! We must have our questions answered.

Thanks Ray for eagerly waiting fo my answers !!!
i thought that some one might have came across this requirement and they can understand my requirement but it was my mistake that i should give more information so people can understand.

Jargons used:
DW: Datawarehouse
DM: Datamart
PII: Personal Identification Information.

Assumptions:
1. I have 3 environment located on different Server.
a. Development (DW/DM)
b. QA (DW/DM)
c. Prodution.(DW/DM)
2. All above environments has their own Database.
3. I have IBM BDW model.

Requirement:
1. I am looking for the information where i can bring production Data to QA environment after doing a Masking on Personal Identification information such as Social Security Number, Date of Birth etc.
For requirement 1 i have source as Production and Target as QA.
To fulfill requirement i need to populate all Source Tables to Target tables based on Extraction Criteria which i will apply for each tables.
2. Each time i should not do this excersice manually so i was thinking to use Schduling tool which will triger a job based on the table's data which i want to load.
3. Since i am going to load production Data to QA after doing Masking on some of the columns to avoid security violation, i also need to think on all Statekey files from soure to target. (Which can be done though Unix Script)

Questions:
------------
Since i already created a generic framework for my requirements, i need to know only one thing, is there any function/Routine i can use to do masking on PII data?
Before doing a Load on my target i will have one criteria where i will specify Column1, Column3, Column4 need to mask with some default values.

Please share your thoguht if you came across any function which will satisfy my need.

Apreciate your time.

ray.wurlod · Post by **ray.wurlod** » Sun Aug 08, 2010 9:29 pm

My initial thought is that the Optim product from IBM does all this and more.

synsog · Post by **synsog** » Sun Aug 08, 2010 9:36 pm

ray.wurlod wrote:My initial thought is that the Optim product from IBM does all this and more.

Thanks Ray for quick Turn Around !!!

I am in a process to Get Optim tool but thats my last option.
If there is anyway same thing i can do with Datastage then i don;t see to invest lot of Dollars on tool.

ray.wurlod · Post by **ray.wurlod** » Mon Aug 09, 2010 2:08 am

You'll probably spend more writing and testing your own data masking routines, test data management systems (you need to preserve relationships between tables) and control systems than you would to purchase Optim.

synsog · Post by **synsog** » Mon Aug 09, 2010 8:12 am

ray.wurlod wrote:You'll probably spend more writing and testing your own data masking routines, test data management systems (you need to preserve relationships between tables) and control systems than you would to purchase Optim.

Thanks Ray For your valuable time on this post.
I will definatly work on this and will try to create a generic framework for all projcts.

Data Scrubbing in ETL

Data Scrubbing in ETL

Re: Data Scrubbing in ETL