Cleanse/Confirm Question ?
Moderators: chulett, rschirm, roy
Cleanse/Confirm Question ?
Guys
I would appreciate your views to implement this logic. Basically I am trying to come up with a process to handle cleansing and confirming during the ETL. I am thinking to have a DB2 table(as our database is DB2) to store our cleanse and confirm rules and do a lookup to this table in DataStage to get the cleansed or confirmed value for given input value.
Question is how do I implement this ? I can't do a simple lookup to this table as my one record can have different fields(columns) which need to be cleansed and all of those fields need to do look up to this table to get the respective value.
Example: I have a record like field1, field2, field3, field4, field5 and field6 in which field3, field4 and field5 need to be cleansed and I have my confirm rules table which has fields like c1,c2,c3,c4. I need to do a lookup to confirm rules table for field3, field4 and field5 and get c4 as return value. Return value will differ based on one of the lookup columns to the table.
In a nutshell, I want to have a function kind of thing which accept inputs and fetch one value from the Table.
Thanks
I would appreciate your views to implement this logic. Basically I am trying to come up with a process to handle cleansing and confirming during the ETL. I am thinking to have a DB2 table(as our database is DB2) to store our cleanse and confirm rules and do a lookup to this table in DataStage to get the cleansed or confirmed value for given input value.
Question is how do I implement this ? I can't do a simple lookup to this table as my one record can have different fields(columns) which need to be cleansed and all of those fields need to do look up to this table to get the respective value.
Example: I have a record like field1, field2, field3, field4, field5 and field6 in which field3, field4 and field5 need to be cleansed and I have my confirm rules table which has fields like c1,c2,c3,c4. I need to do a lookup to confirm rules table for field3, field4 and field5 and get c4 as return value. Return value will differ based on one of the lookup columns to the table.
In a nutshell, I want to have a function kind of thing which accept inputs and fetch one value from the Table.
Thanks
You're on the right track. My company wrote a data cleansing function library to use an external table of data quality rules that is called from DS. It's a lot like you've suggested. The basic idea I would recommend is to not call a function on every column in every row, but rather develop a stand-alone process to cleanse and validate your data in a data file.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Calling a function for each of your column and every row will slow down your entire process. Now days tools are available and improved alot.
Regards
Siva
Listening to the Learned
"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
Siva
Listening to the Learned
"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
Thanks for the input. I am still not clear about how I implement this logic.
I have done similar thing in Informatica easily using a transformation called un-connected lookup which will get all the record from database to cache when the first fetch is executed. so subsequent get to that lookup will be faster. I can just pass values to it and get an output from it.
I am looking for similar functionality in DataStage. Can anyone tell me how I can accomplish the above ?
Thanks
I have done similar thing in Informatica easily using a transformation called un-connected lookup which will get all the record from database to cache when the first fetch is executed. so subsequent get to that lookup will be faster. I can just pass values to it and get an output from it.
I am looking for similar functionality in DataStage. Can anyone tell me how I can accomplish the above ?
Thanks
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Hi kaps,
It is HASHED file and NOT hash file. Could you please explain a bit more elaborately on the functionality that you are trying to achive.
I think rasi was answering about the increasing the performance of a hashed file look-up in DataStage rather than on the original question raised by you.
I did not quite get your original post. May be, you could try to put it in a different way or give a different example.
Thanks,
Naveen.
It is HASHED file and NOT hash file. Could you please explain a bit more elaborately on the functionality that you are trying to achive.
I think rasi was answering about the increasing the performance of a hashed file look-up in DataStage rather than on the original question raised by you.
I did not quite get your original post. May be, you could try to put it in a different way or give a different example.
Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Naveen
Thanks. Well...Let me try to put in different way.
Source Record
--------------
SField1
SField2(Source ID)
SField3
SField4
SField5
SField6
RuleTable(DataBase)
---------
RField1(Source ID)
RField3(Source Value)
RField4(Cleanse Value)
Job Design
----------
SeqFile(Source)--->Transformer--->Target
SField3, SField4 and SField5 of my Source Record has corresponding Conform rules defined in Rule Table.
In my Transformer Stage, When I get my first records from Source, I want to do a lookup to Rule table passing SField2 and Field3 and want to get the RField4 as return value. And then Pass SField2 and Field 3 to rule table and Get RField4 as my return value and so on...
How Do I do this in DataStage ?
By the way, Just curious to know the difference between Hash File and Hashed File ?
Let me know if I can explain it still better...
Thanks
Thanks. Well...Let me try to put in different way.
Source Record
--------------
SField1
SField2(Source ID)
SField3
SField4
SField5
SField6
RuleTable(DataBase)
---------
RField1(Source ID)
RField3(Source Value)
RField4(Cleanse Value)
Job Design
----------
SeqFile(Source)--->Transformer--->Target
SField3, SField4 and SField5 of my Source Record has corresponding Conform rules defined in Rule Table.
In my Transformer Stage, When I get my first records from Source, I want to do a lookup to Rule table passing SField2 and Field3 and want to get the RField4 as return value. And then Pass SField2 and Field 3 to rule table and Get RField4 as my return value and so on...
How Do I do this in DataStage ?
By the way, Just curious to know the difference between Hash File and Hashed File ?
Let me know if I can explain it still better...
Thanks
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Hi kaps,
You job design should be something like this:
Your rule table must be loaded into hashed file with RField1(Source ID), RField3 (Source Value) as keys in the Hashed File.
Join the Source and the Hashed File based on the keys. And when the corresponding key from the source is found in the hashed file, then the derivation of Field4 on the target would be mapped to Field4 of the hashed file.
When you join the keys in the transformer, make sure that the values coming from the source are 'trimmed'. That is, trim(SField2) is joined with RField1 and trim(SField3) is joined to RField3 and so on....
The dictionary can better explain you the difference between Hash and Hashed. As far as I know 'hash' is a dish but 'hashed' conveys the actual meaning which is 'chopped', the algorithm used by hashed files. Data is stored hashed file by chopping the data and storing them in buckets(groups). This is well discussed in previous posts also.
If you have any questions. Please don't hesitate.
Thanks,
Naveen.
You job design should be something like this:
Code: Select all
Hashed File(Rule Table loaded into HF)
|
|
|
Source----------->Transformer-------------->Target Table or File.
Your rule table must be loaded into hashed file with RField1(Source ID), RField3 (Source Value) as keys in the Hashed File.
Join the Source and the Hashed File based on the keys. And when the corresponding key from the source is found in the hashed file, then the derivation of Field4 on the target would be mapped to Field4 of the hashed file.
When you join the keys in the transformer, make sure that the values coming from the source are 'trimmed'. That is, trim(SField2) is joined with RField1 and trim(SField3) is joined to RField3 and so on....
The dictionary can better explain you the difference between Hash and Hashed. As far as I know 'hash' is a dish but 'hashed' conveys the actual meaning which is 'chopped', the algorithm used by hashed files. Data is stored hashed file by chopping the data and storing them in buckets(groups). This is well discussed in previous posts also.
If you have any questions. Please don't hesitate.
Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Thanks Naveen...
BUT the Problem is, when I join the source and hashed file based on keys I agree TField4 will get RField4 but How do I get TField5 which should be again RField4 (and value of this RField4 will be different then the previous RField4 Because now my source value is SField5 which was SField4 last time)....did you get it ? Please let me know...
Kaps
BUT the Problem is, when I join the source and hashed file based on keys I agree TField4 will get RField4 but How do I get TField5 which should be again RField4 (and value of this RField4 will be different then the previous RField4 Because now my source value is SField5 which was SField4 last time)....did you get it ? Please let me know...
Kaps
-
- Premium Member
- Posts: 1255
- Joined: Wed Feb 02, 2005 11:54 am
- Location: United States of America
Hi kaps,
I think I understood what you are trying to convey. But it would MUCH better if you could give some sample records in source, lookup and the target.
That would put me in a better position to give a better solution.
Thanks,
Naveen.
I think I understood what you are trying to convey. But it would MUCH better if you could give some sample records in source, lookup and the target.
That would put me in a better position to give a better solution.
Thanks,
Naveen.
Anything that won't sell, I don't want to invent. Its sale is proof of utility, and utility is success.
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Author: Thomas A. Edison 1847-1931, American Inventor, Entrepreneur, Founder of GE
Naveen-
Here you go...
[code]Source Record (File)
--------------------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 Male Nebraska United States
RuleTable(DataBase)
--------------------
Source ID Source Value Cleansed Value
-------------------------------------------------------
SRC1 Male M
SRC1 Nebraska NE
SRC1 United States USA
Target
------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 M NE USA[/code]
For Gender, I need to match my rule table (database or hash file) by SRC1 and Male and should get M as return value. Again I need to match the rule table for SRC1 and Nebraska and get NE as my return value.Again I need to match the rule table for SRC1 and United States and get USA as my return value.
I can do this if I hardcode it in transformer stage. but I don't want to do that. How do I do this by looking up a table ?
Thanks
Here you go...
[code]Source Record (File)
--------------------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 Male Nebraska United States
RuleTable(DataBase)
--------------------
Source ID Source Value Cleansed Value
-------------------------------------------------------
SRC1 Male M
SRC1 Nebraska NE
SRC1 United States USA
Target
------
Source ID Gender State Cntry_Code
-----------------------------------------------------------
SRC1 M NE USA[/code]
For Gender, I need to match my rule table (database or hash file) by SRC1 and Male and should get M as return value. Again I need to match the rule table for SRC1 and Nebraska and get NE as my return value.Again I need to match the rule table for SRC1 and United States and get USA as my return value.
I can do this if I hardcode it in transformer stage. but I don't want to do that. How do I do this by looking up a table ?
Thanks
A simple way would be to have 3 hashed files, each with two columns, both of which are keys. example hfile1(SourceID, Gender), hfile2(SourceID, State), hfile3(SourceID, Cntry_Code). Next, do a lookup of your source rows against these hashed files.kaps wrote:Naveen-
Here you go...
ThanksCode: Select all
Source Record (File) -------------------- Source ID Gender State Cntry_Code ----------------------------------------------------------- SRC1 Male Nebraska United States RuleTable(DataBase) -------------------- Source ID Source Value Cleansed Value ------------------------------------------------------- SRC1 Male M SRC1 Nebraska NE SRC1 United States USA Target ------ Source ID Gender State Cntry_Code ----------------------------------------------------------- SRC1 M NE USA
gateleys