Generating Surrogate Keys via hash functions

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Generating Surrogate Keys via hash functions

Post by Yuan_Edward »

My job is to load flat files into the target database (DB1 for short). Some records will have data quality issues, e.g. where a mandatory field has no vaule, but will be corrected and still get to load into the database. At the same time these records are required to be captured and loaded into a separate data qulity database (DB2 for short) for analysis.

In order to track the data quality we need to link DB1 to DB2, the idea is to assign SK for rows which have quality issues.

I am thinking to write a hash funtion, e.g. something like CRC32 or MD5 chechsum to produce the SKs based on the Natural Keys instead of the other options. And so it can be used in the Transformer stage to generate the SKs for both streams (DB1 and DB2). The main purpose of this is to avoid joining DB2 stream back to DB1 once the Sks are allocated.

Hope my description is clear. Just want to get some comments from your guys here before I go ahead.

Thank you.
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Though these functions were not exactly ment for that, you can still use it.
Even if you create the SK, you will need to link the DB atleast once to fetch the correct values from DB2. If you have very large column or more than one column to lookup, this approach will give you benefit.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

:) Yes, CRC/MD5 is not designed for SKs but it is fit for this purpose, so I am "stealing" it.

Our project is processing huge volumnes of data so I am trying to avoid that kind of lookup.

And i think hash functions will be generic and easy to use everywhere.
kumar_s wrote:Though these functions were not exactly ment for that, you can still use it.
Even if you create the SK, you will need to link the DB atleast once to fetch the correct values from DB2. If you have very large column or more than one column to lookup, this approach will give you benefit.
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

SK is just an additional key apart form the natural key which are created for some purpose. So Iam not sure how you going to avoid lookup if you create a SK. :roll:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

That's the magic of hash functions.

If I apply the same function in two places (one in the ETL flow of loading DB1 and the other in the ETL flow of DB2), I would expect I get the same SKs for the same NKs for both.
kumar_s wrote:SK is just an additional key apart form the natural key which are created for some purpose. So Iam not sure how you going to avoid lookup if you create a SK. :roll: ...
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

:lol: Yeah, I understand the magic. Iam asking, how will you reconcile the data between the two database unless doing a lookup.
It will produce the same result if the input value is same. To find those records which out of sync with metadata, you need to do either join or lookup anyways.
Last edited by kumar_s on Tue Mar 13, 2007 1:14 am, edited 1 time in total.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

Good point. Thanks.

It should be ok as long as the records in DB1 get SKs allocated. I am assuming that records will be loaded into DB2 (the data quality database) without failure, if they do fail, they have to be fixed and loaded into DB2 again until success...that means, all exceptions will be loaded into DB2 anyway.

This is not clear for me at the moment and should be considered during the design. Thanks again.
kumar_s wrote::lol: Yeah, I understand the magic. Iam asking, how will you reconcile the data between the two database unless doing a lookup.
It will produce the same result if the input value is same. To find those records which out of sync with metadata.
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Ok, so you are loading DB2 from DB1 for those records where SK got populated. And SK is populated only if the records are out of sync with metadata. Is this the way you are planning about?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

Hi Kumar_s, to give you more details, I have a simplified job below:

sequencial file =>Transformer=>some additional logic 1=>load into target database (DB 1)
................................||
................................\/
..........................Some addtional logic 2
................................||
................................\/
..........................Load into data quality database (DB 2)


Suppose record #1 from the sequencial file has quality issue, it is detected in the transformer stage, the corrected record goes to logic 1 and then got loaded into DB 1; the original record goes to logic 2 and got loaded into DB 2.

The record in DB 1 should have a data quality SK, which is a FK pointing to the same record in DB 2. The normal way to do this is to allocate SKs in logic 2 flow first and then do a lookup/join in logic 1 flow to get the SKs for DB 1.

To aviod this kind of lookup/join, my idea is to use a hash funtion in logic 1 and logic 2 respectively to get the same SKs for both, the inpur for the hash function are the NKs. (You could say why not generate the SKs for both in the same transformer stage. The reason is that there are multiple places in the ETL flow that could detect data quality issues, e.g. one for metadata, another one for business rules etc.)

Record #2 has no quality issue and gets loaded into DB 1 only, thus it has no data quality SK in DB 1.
kumar_s wrote:Ok, so you are loading DB2 from DB1 for those records where SK got populated. And SK is populated only if the records are out of sync with metadata. Is this the way you are planning about? ...
Edward Yuan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Sounds reasonable. So those records which doesn't have any data quality issues will not be populated with SK.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What's wrong with using the Change Capture, Compare or Difference stage to detect change, and Surrogate Key Generator stage to generate surrogate keys?!!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Here OP is generating for further analysis. For those records which doesnt have mandatory (lookup) key, will be generated with SK based on some other existing column. And the main data can be lookup up based on the SK generated.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I know no-one has asked for it, but thought I would throw in my two cents. I've said this before and I'll say it again. :wink:

I would not consider any kind of 'hashing' algorithm suitable for surrogate keys. Period. If forced to choose between your two choices, I'd go MD5 over CRC32 or would push for one of the 'newer' ones whose names escape me at this hour of the morning. :?

Plenty of references to this inappropriateness out in the wild. This is one that took all of two seconds to find, I'm sure there are 'better' ones.

IMHO - stick with a 'traditional' surrogate key, a big incremented integer. Assigned or derive absolutely no intelligence to or from it. And depending on how you handle the generation, you may not even need to make the trip back to the database to look it up you seem to want to avoid. At worst case, it's one trip to 'seed' the correct starting point at the beginning and then possibly one at the end to record the last one used.
-craig

"You can never have too many knives" -- Logan Nine Fingers
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Likewise Craig. Stick to generation of running number.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Yuan_Edward
Participant
Posts: 73
Joined: Tue May 10, 2005 6:21 pm
Location: Sydney

Post by Yuan_Edward »

Hi DSguru, why do you think running numbers is better choice of SKs? I know typical hash functions will return a fixed-length string. I haven't decided which function to use, and I am not sure if hash function could return numbers.
DSguru2B wrote:Likewise Craig. Stick to generation of running number.
Edward Yuan
Post Reply