Cleanse/Confirm Question ?

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Cleanse/Confirm Question ?

Post by kaps »

Guys

I would appreciate your views to implement this logic. Basically I am trying to come up with a process to handle cleansing and confirming during the ETL. I am thinking to have a DB2 table(as our database is DB2) to store our cleanse and confirm rules and do a lookup to this table in DataStage to get the cleansed or confirmed value for given input value.

Question is how do I implement this ? I can't do a simple lookup to this table as my one record can have different fields(columns) which need to be cleansed and all of those fields need to do look up to this table to get the respective value.

Example: I have a record like field1, field2, field3, field4, field5 and field6 in which field3, field4 and field5 need to be cleansed and I have my confirm rules table which has fields like c1,c2,c3,c4. I need to do a lookup to confirm rules table for field3, field4 and field5 and get c4 as return value. Return value will differ based on one of the lookup columns to the table.

In a nutshell, I want to have a function kind of thing which accept inputs and fetch one value from the Table.

Thanks
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You're on the right track. My company wrote a data cleansing function library to use an external table of data quality rules that is called from DS. It's a lot like you've suggested. The basic idea I would recommend is to not call a function on every column in every row, but rather develop a stand-alone process to cleanse and validate your data in a data file.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Calling a function for each of your column and every row will slow down your entire process. Now days tools are available and improved alot.
Regards
Siva

Listening to the Learned

"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Thanks for the input. I am still not clear about how I implement this logic.

I have done similar thing in Informatica easily using a transformation called un-connected lookup which will get all the record from database to cache when the first fetch is executed. so subsequent get to that lookup will be faster. I can just pass values to it and get an output from it.

I am looking for similar functionality in DataStage. Can anyone tell me how I can accomplish the above ?

Thanks
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

This functionality is well handled in Datastage by using hash file and having the pre-load memory option enabled
Regards
Siva

Listening to the Learned

"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

No. I don't think hash file is capable of doing this functionality. If I have to use hash file and if my file has 10 column in a record which need to be cleansed then I need 10 hash file stages from my transformer. Am I correct ?
Please clarify my if you think I am wrong.
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Hi kaps,

It is HASHED file and NOT hash file. Could you please explain a bit more elaborately on the functionality that you are trying to achive.

I think rasi was answering about the increasing the performance of a hashed file look-up in DataStage rather than on the original question raised by you.

I did not quite get your original post. May be, you could try to put it in a different way or give a different example.

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Naveen

Thanks. Well...Let me try to put in different way.

Source Record
--------------
SField1
SField2(Source ID)
SField3
SField4
SField5
SField6

RuleTable(DataBase)
---------
RField1(Source ID)
RField3(Source Value)
RField4(Cleanse Value)


Job Design
----------
SeqFile(Source)--->Transformer--->Target


SField3, SField4 and SField5 of my Source Record has corresponding Conform rules defined in Rule Table.

In my Transformer Stage, When I get my first records from Source, I want to do a lookup to Rule table passing SField2 and Field3 and want to get the RField4 as return value. And then Pass SField2 and Field 3 to rule table and Get RField4 as my return value and so on...

How Do I do this in DataStage ?

By the way, Just curious to know the difference between Hash File and Hashed File ?

Let me know if I can explain it still better...

Thanks
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Hi kaps,

You job design should be something like this:

Code: Select all

 
            Hashed File(Rule Table loaded into HF)                        
                                |
                                |
                                |
Source----------->Transformer-------------->Target Table or File.



Your rule table must be loaded into hashed file with RField1(Source ID), RField3 (Source Value) as keys in the Hashed File.

Join the Source and the Hashed File based on the keys. And when the corresponding key from the source is found in the hashed file, then the derivation of Field4 on the target would be mapped to Field4 of the hashed file.

When you join the keys in the transformer, make sure that the values coming from the source are 'trimmed'. That is, trim(SField2) is joined with RField1 and trim(SField3) is joined to RField3 and so on....

The dictionary can better explain you the difference between Hash and Hashed. As far as I know 'hash' is a dish but 'hashed' conveys the actual meaning which is 'chopped', the algorithm used by hashed files. Data is stored hashed file by chopping the data and storing them in buckets(groups). This is well discussed in previous posts also.

If you have any questions. Please don't hesitate.

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

Naveen I like your analogy "dish" ,"chopped" :wink:
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Thanks Naveen...

BUT the Problem is, when I join the source and hashed file based on keys I agree TField4 will get RField4 but How do I get TField5 which should be again RField4 (and value of this RField4 will be different then the previous RField4 Because now my source value is SField5 which was SField4 last time)....did you get it ? Please let me know...

Kaps
I_Server_Whale
Premium Member
Premium Member
Posts: 1255
Joined: Wed Feb 02, 2005 11:54 am
Location: United States of America

Post by I_Server_Whale »

Hi kaps,

I think I understood what you are trying to convey. But it would MUCH better if you could give some sample records in source, lookup and the target.

That would put me in a better position to give a better solution.

Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

Naveen-

Here you go...


[code]Source Record (File)
--------------------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 Male Nebraska United States


RuleTable(DataBase)
--------------------
Source ID Source Value Cleansed Value
-------------------------------------------------------
SRC1 Male M
SRC1 Nebraska NE
SRC1 United States USA


Target
------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 M NE USA[/code]

For Gender, I need to match my rule table (database or hash file) by SRC1 and Male and should get M as return value. Again I need to match the rule table for SRC1 and Nebraska and get NE as my return value.Again I need to match the rule table for SRC1 and United States and get USA as my return value.

I can do this if I hardcode it in transformer stage. but I don't want to do that. How do I do this by looking up a table ?

Thanks
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

kaps wrote:Naveen-

Here you go...

Code: Select all

Source Record (File)
-------------------- 
Source ID	Gender		State		Cntry_Code	
-----------------------------------------------------------
SRC1		Male		Nebraska	United States


RuleTable(DataBase) 
-------------------- 
Source ID	Source Value		Cleansed Value		
-------------------------------------------------------
SRC1		Male			M
SRC1		Nebraska		NE
SRC1		United States		USA


Target
------
Source ID	Gender		State		Cntry_Code	
-----------------------------------------------------------
SRC1		M		NE		USA
Thanks
A simple way would be to have 3 hashed files, each with two columns, both of which are keys. example hfile1(SourceID, Gender), hfile2(SourceID, State), hfile3(SourceID, Cntry_Code). Next, do a lookup of your source rows against these hashed files.

gateleys
kaps
Participant
Posts: 452
Joined: Tue May 10, 2005 12:36 pm

Post by kaps »

thats just an example I was describing...I can't create a table for every field which need to be cleansed.
Post Reply