Generation of Surrogate Key

tsktsk123 · Post by **tsktsk123** » Mon Feb 18, 2008 9:54 am

Hi,

The requirement is to load surrogate key in the target table to maintain the uniqueness in the target table. Daily we need to run this job and new records will be loaded to the target, i like to know if datastage will save the last generated the surrogate key during it's last run and increment the same by 1 when we start the job next day. Can someone throw any light on this? I know this is possible in informatica as it saves the last generated surrogate key in the repository. I like to know if the same scenerio is applicable in datastage. I appreciate any help on this.

regards,
senthil kumar

kcbland · Post by **kcbland** » Mon Feb 18, 2008 10:05 am

No, DS does not store the last key used unless you use a mechanism that does that. In Server there is the SDK library that uses this method.

BOTH INFORMATICA AND THE SDK LIBRARY METHODS ARE BAD.

The system of record for the last value assigned IS THE TARGET TABLE. Not a database sequence, not a local file, not a tool library.

So I suggest whatever method you decide to use involves first doing a SELECT MAX(skey) FROM target_table and updating the stored value.

There's a concept of metaphysical certitude versus good-enough. I like to build better systems than good-enough.

chulett · Post by **chulett** » Mon Feb 18, 2008 10:08 am

First suggestion would be to search the forums as you're not the first to ask this. Secondly, does not the Surrogate Key Generator stage exist in your version?

kcbland · Post by **kcbland** » Mon Feb 18, 2008 10:09 am

One more comment on this. I love

hearing about people screwing up the data in a DEV or TEST environment because of their surrogate key management.

Imagine the scenario where the last-used surrogate key in the local file/tool is 1000000. But, you now refresh the data in that environment from PROD where the data actually has a last-used value of 2000000.

If your ETL jobstream doesn't do the select max then already assigned values are used again. The 2 seconds it takes to update the file seems like a small penalty to pay for always making sure your ETL code assigns unique values.

This happens ALL OF THE TIME and it makes me chuckle.

chulett · Post by **chulett** » Mon Feb 18, 2008 10:10 am

Welcome to the dark side of Kenneth Bland, ladies and gentlemen.

kcbland · Post by **kcbland** » Mon Feb 18, 2008 10:18 am

Today is just one of those days.

ray.wurlod · Post by **ray.wurlod** » Mon Feb 18, 2008 3:44 pm

On the other hand, repairing the damage is a potential source of revenue for consultants such as KBA - on that basis you could understand Ken's glee.

kcbland · Post by **kcbland** » Mon Feb 18, 2008 3:49 pm

Mr. Glee, that's me.

JoshGeorge · Post by **JoshGeorge** » Mon Feb 18, 2008 5:45 pm

I have a blog post on the same - Surrogate Key Generation in DataStage - An elegant way. If you know to use DataStage Job Control Interfaces specified in Advanced developer guide this will be really helpful.

kcbland · Post by **kcbland** » Mon Feb 18, 2008 6:33 pm

Great post, I just disagree with every molecule in my body with the act of not initializing the seed value from a select max operation on the database.

I've seen folks store the max skeys in a control table, in database Sequence values, etc. In addition to my previous example, there's also the situation where a one-off load into a table is used and someone forgets to reset the stored max skey to the new max.

For the life of me I can't figure out why it's such a big deal to select the max surrogate key on tables. These columns are usually indexed, either as a primary key (bad idea) but hopefully by a secondary index. The database spends more time parsing the query then reading the max value off the index.

JoshGeorge · Post by **JoshGeorge** » Mon Feb 18, 2008 7:12 pm

Thanks Kenneth!

This post is a complete solution using the tool. Have emphasised my point on that and haven't touched the "database vs tool" topic either. Have given notes on how the design makes sure the latest value is stored and retrieved while several jobs try to generate surrogate keys for the same target table also.

I just disagree with every molecule in my body with the act of not initializing the seed value from a select max operation on the database.

Tool based key generation in DataStage server jobs using SDK routines / KeyMgtGetNextValue doesn't do that. This approch also dosen't take the seed from database.

In addition to my previous example, there's also the situation where a one-off load into a table is used and someone forgets to reset the stored max skey to the new max.

Will never happen according to the job design I have specified in the post.

The database spends more time parsing the query then reading the max value off the index.

Yes !

Control table or state files or database Sequence or the approach I mentioned in the post or you said above - choice is by consideration of your ETL architecture and personal conformability.

Again, thanks for the comments.

kcbland · Post by **kcbland** » Mon Feb 18, 2008 7:28 pm

Whoa... wait a minute, I simple stated that the definitive, absolute, 100% guaranteed to never fail, and is always right place is to go to the table itself and ask it what is the max used value. This point can't be debated. There can be only one absolutely best answer to what is the max used key value. When I layout architecture plans for customers this is a fundamental point of choosing metaphysical certitude approaches whenever possible to close the door on those pesky little bite-you-in-the-behind problems before they get in.

I also never said the SDK routines do it right. I know not everyone is going to search, but on this forum I have pontificated about always setting the maximum value used within an ETL jobstream cycle to guarantee the job starts from the maximum used. I have spent years explaining why the SDK routines have serious flaws and regular recommend to folks not to use them. It's not the fault of the routines that folks don't build the select max process to keep them in sync.

As for tightly controlling data loading into an enterprise data warehouse I've never won the battle. There's always some other team or group who sneak a data load into a table, massage data, whatever, and we find out after the fact. In embedded applications where there's a single team managing the data it's easier to route everything thru the team. But in companies with 100+ developers divided across 10 teams you always get things done out of standard or unconventionally. Surrogate key reuse is just one of the nastier problems we face.

We usually spend so much time arguing that it's better to assign within the ETL jobstream than during loading that when it screws up we look like idiots.

I hate looking like an idiot.

kcbland · Post by **kcbland** » Mon Feb 18, 2008 7:41 pm

I like to include a single job at the start of every jobstream like this:

ODBC/OCI-->XFM-->HASH/SEQ

The user-defined SQL is something like this:

Code: Select all

SELECT 'TABLE_A', MAX(SKEY) FROM TABLE_A
UNION ALL
SELECT 'TABLE_B', MAX(SKEY) FROM TABLE_B
UNION ALL
SELECT 'TABLE_C', MAX(SKEY) FROM TABLE_C
UNION ALL
SELECT 'TABLE_D, MAX(SKEY) FROM TABLE_D

You can make the HASH file the SDK_SEQUENCES or your own if you use better surrogate key assignment functions that what's supplied by the SDK.

If you write it to a SEQ file, in the Transformer stage modify the output to format the data like environment variables. Ex:

Code: Select all

EXPORT MAX_TABLE_A=1234

Or, you can leave it in a SEQ file and in before-routines or stage-variable initialization steps "grep" the file for your table and have it return as the initial value of the stage variable.

You can do anything you want, but my ONLY point I want to make is to recommend that you ALWAYS start your assignment from the absolutely known maximum value used. It's a best practice.

chulett · Post by **chulett** » Mon Feb 18, 2008 8:16 pm

We get it, we get it.

Since you seem to be on a roll today, Ken, let's continue rolling. The '(bad idea)' part of this caught my attention:

kcbland wrote:For the life of me I can't figure out why it's such a big deal to select the max surrogate key on tables. These columns are usually indexed, either as a primary key (bad idea) but hopefully by a secondary index.

Why for bad idea?

kcbland · Post by **kcbland** » Mon Feb 18, 2008 8:51 pm

For Craig:

If you create a table with a primary key index in Oracle, Sybase, etc, you can't drop the index without dropping the table. If you create the table without a primary key index, but add a unique index afterwards, the database will use that one just the same as a primary key index.

Now why would one want to drop a primary key index? DUPLICATES!! AHAHHHAHHAHA

I got 'em!

If you DIRECT PATH bulk load your table in Oracle, the duplicates aren't found until the load has completed and the deferred index updates are attempted. AAHAHAHHAHAHAH Unusable indexes due to duplicates! Now try to find them!!! AHAHAHHAAH You can't query the table and you can't drop the index.

If you created a secondary unique index, you have to find duplicates without the benefit of an index.

That's why it's recommended that you use non-unique secondary indexes instead of unique or primary key indexes in data warehouses.

So, go to AskTom.Oracle.com and read the wisdom therein, better yet, buy his books. He speaks specifically to the differences between OLTP and OLAP architectures. The gem is the one about non-unique secondary indexes - that one is powerful in Oracle DIRECT path loaded surrogate key managed environments.