Cleanse/Confirm Question ?

kaps · Post by **kaps** » Mon Feb 27, 2006 9:06 pm

Guys

I would appreciate your views to implement this logic. Basically I am trying to come up with a process to handle cleansing and confirming during the ETL. I am thinking to have a DB2 table(as our database is DB2) to store our cleanse and confirm rules and do a lookup to this table in DataStage to get the cleansed or confirmed value for given input value.

Question is how do I implement this ? I can't do a simple lookup to this table as my one record can have different fields(columns) which need to be cleansed and all of those fields need to do look up to this table to get the respective value.

Example: I have a record like field1, field2, field3, field4, field5 and field6 in which field3, field4 and field5 need to be cleansed and I have my confirm rules table which has fields like c1,c2,c3,c4. I need to do a lookup to confirm rules table for field3, field4 and field5 and get c4 as return value. Return value will differ based on one of the lookup columns to the table.

In a nutshell, I want to have a function kind of thing which accept inputs and fetch one value from the Table.

Thanks

kcbland · Post by **kcbland** » Mon Feb 27, 2006 10:16 pm

You're on the right track. My company wrote a data cleansing function library to use an external table of data quality rules that is called from DS. It's a lot like you've suggested. The basic idea I would recommend is to not call a function on every column in every row, but rather develop a stand-alone process to cleanse and validate your data in a data file.

rasi · Post by **rasi** » Mon Feb 27, 2006 10:49 pm

Calling a function for each of your column and every row will slow down your entire process. Now days tools are available and improved alot.

kaps · Post by **kaps** » Tue Feb 28, 2006 8:58 pm

Thanks for the input. I am still not clear about how I implement this logic.

I have done similar thing in Informatica easily using a transformation called un-connected lookup which will get all the record from database to cache when the first fetch is executed. so subsequent get to that lookup will be faster. I can just pass values to it and get an output from it.

I am looking for similar functionality in DataStage. Can anyone tell me how I can accomplish the above ?

Thanks

rasi · Post by **rasi** » Tue Feb 28, 2006 9:44 pm

This functionality is well handled in Datastage by using hash file and having the pre-load memory option enabled

kaps · Post by **kaps** » Wed Mar 01, 2006 9:22 am

No. I don't think hash file is capable of doing this functionality. If I have to use hash file and if my file has 10 column in a record which need to be cleansed then I need 10 hash file stages from my transformer. Am I correct ?
Please clarify my if you think I am wrong.

I_Server_Whale · Post by **I_Server_Whale** » Wed Mar 01, 2006 10:07 am

Hi kaps,

It is HASHED file and NOT hash file. Could you please explain a bit more elaborately on the functionality that you are trying to achive.

I think rasi was answering about the increasing the performance of a hashed file look-up in DataStage rather than on the original question raised by you.

I did not quite get your original post. May be, you could try to put it in a different way or give a different example.

Thanks,
Naveen.

kaps · Post by **kaps** » Wed Mar 01, 2006 12:38 pm

Naveen

Thanks. Well...Let me try to put in different way.

Source Record
--------------
SField1
SField2(Source ID)
SField3
SField4
SField5
SField6

RuleTable(DataBase)
---------
RField1(Source ID)
RField3(Source Value)
RField4(Cleanse Value)

Job Design
----------
SeqFile(Source)--->Transformer--->Target

SField3, SField4 and SField5 of my Source Record has corresponding Conform rules defined in Rule Table.

In my Transformer Stage, When I get my first records from Source, I want to do a lookup to Rule table passing SField2 and Field3 and want to get the RField4 as return value. And then Pass SField2 and Field 3 to rule table and Get RField4 as my return value and so on...

How Do I do this in DataStage ?

By the way, Just curious to know the difference between Hash File and Hashed File ?

Let me know if I can explain it still better...

Thanks

I_Server_Whale · Post by **I_Server_Whale** » Wed Mar 01, 2006 1:13 pm

Hi kaps,

You job design should be something like this:

Code: Select all

 
            Hashed File(Rule Table loaded into HF)                        
                                |
                                |
                                |
Source----------->Transformer-------------->Target Table or File.

Your rule table must be loaded into hashed file with RField1(Source ID), RField3 (Source Value) as keys in the Hashed File.

Join the Source and the Hashed File based on the keys. And when the corresponding key from the source is found in the hashed file, then the derivation of Field4 on the target would be mapped to Field4 of the hashed file.

When you join the keys in the transformer, make sure that the values coming from the source are 'trimmed'. That is, trim(SField2) is joined with RField1 and trim(SField3) is joined to RField3 and so on....

The dictionary can better explain you the difference between Hash and Hashed. As far as I know 'hash' is a dish but 'hashed' conveys the actual meaning which is 'chopped', the algorithm used by hashed files. Data is stored hashed file by chopping the data and storing them in buckets(groups). This is well discussed in previous posts also.

If you have any questions. Please don't hesitate.

Thanks,
Naveen.

narasimha · Post by **narasimha** » Wed Mar 01, 2006 1:54 pm

Naveen I like your analogy "dish" ,"chopped"

kaps · Post by **kaps** » Wed Mar 01, 2006 2:37 pm

Thanks Naveen...

BUT the Problem is, when I join the source and hashed file based on keys I agree TField4 will get RField4 but How do I get TField5 which should be again RField4 (and value of this RField4 will be different then the previous RField4 Because now my source value is SField5 which was SField4 last time)....did you get it ? Please let me know...

Kaps

I_Server_Whale · Post by **I_Server_Whale** » Wed Mar 01, 2006 3:14 pm

Hi kaps,

I think I understood what you are trying to convey. But it would MUCH better if you could give some sample records in source, lookup and the target.

That would put me in a better position to give a better solution.

Thanks,
Naveen.

kaps · Post by **kaps** » Wed Mar 01, 2006 4:31 pm

Naveen-

Here you go...

[code]Source Record (File)
--------------------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 Male Nebraska United States

RuleTable(DataBase)
--------------------
Source ID Source Value Cleansed Value
-------------------------------------------------------
SRC1 Male M
SRC1 Nebraska NE
SRC1 United States USA

Target
------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 M NE USA[/code]

For Gender, I need to match my rule table (database or hash file) by SRC1 and Male and should get M as return value. Again I need to match the rule table for SRC1 and Nebraska and get NE as my return value.Again I need to match the rule table for SRC1 and United States and get USA as my return value.

I can do this if I hardcode it in transformer stage. but I don't want to do that. How do I do this by looking up a table ?

Thanks

gateleys · Post by **gateleys** » Wed Mar 01, 2006 4:43 pm

kaps wrote:Naveen-

Here you go...

Code: Select all

Source Record (File)
-------------------- 
Source ID	Gender		State		Cntry_Code	
-----------------------------------------------------------
SRC1		Male		Nebraska	United States


RuleTable(DataBase) 
-------------------- 
Source ID	Source Value		Cleansed Value		
-------------------------------------------------------
SRC1		Male			M
SRC1		Nebraska		NE
SRC1		United States		USA


Target
------
Source ID	Gender		State		Cntry_Code	
-----------------------------------------------------------
SRC1		M		NE		USA

Thanks

A simple way would be to have 3 hashed files, each with two columns, both of which are keys. example hfile1(SourceID, Gender), hfile2(SourceID, State), hfile3(SourceID, Cntry_Code). Next, do a lookup of your source rows against these hashed files.

gateleys

kaps · Post by **kaps** » Wed Mar 01, 2006 9:14 pm

thats just an example I was describing...I can't create a table for every field which need to be cleansed.