Generating Surrogate Keys via hash functions

Yuan_Edward · Post by **Yuan_Edward** » Mon Mar 12, 2007 10:45 pm

My job is to load flat files into the target database (DB1 for short). Some records will have data quality issues, e.g. where a mandatory field has no vaule, but will be corrected and still get to load into the database. At the same time these records are required to be captured and loaded into a separate data qulity database (DB2 for short) for analysis.

In order to track the data quality we need to link DB1 to DB2, the idea is to assign SK for rows which have quality issues.

I am thinking to write a hash funtion, e.g. something like CRC32 or MD5 chechsum to produce the SKs based on the Natural Keys instead of the other options. And so it can be used in the Transformer stage to generate the SKs for both streams (DB1 and DB2). The main purpose of this is to avoid joining DB2 stream back to DB1 once the Sks are allocated.

Hope my description is clear. Just want to get some comments from your guys here before I go ahead.

Thank you.

kumar_s · Post by **kumar_s** » Mon Mar 12, 2007 11:45 pm

Though these functions were not exactly ment for that, you can still use it.
Even if you create the SK, you will need to link the DB atleast once to fetch the correct values from DB2. If you have very large column or more than one column to lookup, this approach will give you benefit.

Yuan_Edward · Post by **Yuan_Edward** » Mon Mar 12, 2007 11:50 pm

Yes, CRC/MD5 is not designed for SKs but it is fit for this purpose, so I am "stealing" it.

Our project is processing huge volumnes of data so I am trying to avoid that kind of lookup.

And i think hash functions will be generic and easy to use everywhere.

kumar_s wrote:Though these functions were not exactly ment for that, you can still use it.
Even if you create the SK, you will need to link the DB atleast once to fetch the correct values from DB2. If you have very large column or more than one column to lookup, this approach will give you benefit.

kumar_s · Post by **kumar_s** » Mon Mar 12, 2007 11:53 pm

SK is just an additional key apart form the natural key which are created for some purpose. So Iam not sure how you going to avoid lookup if you create a SK.

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 12:24 am

That's the magic of hash functions.

If I apply the same function in two places (one in the ETL flow of loading DB1 and the other in the ETL flow of DB2), I would expect I get the same SKs for the same NKs for both.

kumar_s wrote:SK is just an additional key apart form the natural key which are created for some purpose. So Iam not sure how you going to avoid lookup if you create a SK. ...

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 12:38 am

Yeah, I understand the magic. Iam asking, how will you reconcile the data between the two database unless doing a lookup.
It will produce the same result if the input value is same. To find those records which out of sync with metadata, you need to do either join or lookup anyways.

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 1:12 am

Good point. Thanks.

It should be ok as long as the records in DB1 get SKs allocated. I am assuming that records will be loaded into DB2 (the data quality database) without failure, if they do fail, they have to be fixed and loaded into DB2 again until success...that means, all exceptions will be loaded into DB2 anyway.

This is not clear for me at the moment and should be considered during the design. Thanks again.

kumar_s wrote: Yeah, I understand the magic. Iam asking, how will you reconcile the data between the two database unless doing a lookup.
It will produce the same result if the input value is same. To find those records which out of sync with metadata.

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 1:18 am

Ok, so you are loading DB2 from DB1 for those records where SK got populated. And SK is populated only if the records are out of sync with metadata. Is this the way you are planning about?

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 3:28 am

Hi Kumar_s, to give you more details, I have a simplified job below:

sequencial file =>Transformer=>some additional logic 1=>load into target database (DB 1)
................................||
................................\/
..........................Some addtional logic 2
................................||
................................\/
..........................Load into data quality database (DB 2)

Suppose record #1 from the sequencial file has quality issue, it is detected in the transformer stage, the corrected record goes to logic 1 and then got loaded into DB 1; the original record goes to logic 2 and got loaded into DB 2.

The record in DB 1 should have a data quality SK, which is a FK pointing to the same record in DB 2. The normal way to do this is to allocate SKs in logic 2 flow first and then do a lookup/join in logic 1 flow to get the SKs for DB 1.

To aviod this kind of lookup/join, my idea is to use a hash funtion in logic 1 and logic 2 respectively to get the same SKs for both, the inpur for the hash function are the NKs. (You could say why not generate the SKs for both in the same transformer stage. The reason is that there are multiple places in the ETL flow that could detect data quality issues, e.g. one for metadata, another one for business rules etc.)

Record #2 has no quality issue and gets loaded into DB 1 only, thus it has no data quality SK in DB 1.

kumar_s wrote:Ok, so you are loading DB2 from DB1 for those records where SK got populated. And SK is populated only if the records are out of sync with metadata. Is this the way you are planning about? ...

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 4:35 am

Sounds reasonable. So those records which doesn't have any data quality issues will not be populated with SK.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 13, 2007 6:46 am

What's wrong with using the Change Capture, Compare or Difference stage to detect change, and Surrogate Key Generator stage to generate surrogate keys?!!

kumar_s · Post by **kumar_s** » Tue Mar 13, 2007 6:49 am

Here OP is generating for further analysis. For those records which doesnt have mandatory (lookup) key, will be generated with SK based on some other existing column. And the main data can be lookup up based on the SK generated.

chulett · Post by **chulett** » Tue Mar 13, 2007 7:06 am

I know no-one has asked for it, but thought I would throw in my two cents. I've said this before and I'll say it again.

I would not consider any kind of 'hashing' algorithm suitable for surrogate keys. Period. If forced to choose between your two choices, I'd go MD5 over CRC32 or would push for one of the 'newer' ones whose names escape me at this hour of the morning.

Plenty of references to this inappropriateness out in the wild. This is one that took all of two seconds to find, I'm sure there are 'better' ones.

IMHO - stick with a 'traditional' surrogate key, a big incremented integer. Assigned or derive absolutely no intelligence to or from it. And depending on how you handle the generation, you may not even need to make the trip back to the database to look it up you seem to want to avoid. At worst case, it's one trip to 'seed' the correct starting point at the beginning and then possibly one at the end to record the last one used.

DSguru2B · Post by **DSguru2B** » Tue Mar 13, 2007 7:23 am

Likewise Craig. Stick to generation of running number.

Yuan_Edward · Post by **Yuan_Edward** » Tue Mar 13, 2007 5:54 pm

Hi DSguru, why do you think running numbers is better choice of SKs? I know typical hash functions will return a fixed-length string. I haven't decided which function to use, and I am not sure if hash function could return numbers.

DSguru2B wrote:Likewise Craig. Stick to generation of running number.