Page 1 of 1

Surrogate Key using a state file or Database

Posted: Wed Oct 15, 2014 1:06 pm
by collabxchange
Hi All,
I am trying to generate surrogate key or a sequence number for one of my job. As I understand there are 3 ways to do it.

Option # 1: Use Database Sequence if your target is a table. Unfortunately, I can't use this.

Option # 2: Use Surrogate key generator or the surrogate key tab in the stages like transformer and then use a Flat File or DB sequence. If I go for the flat file approach with "In blocks of : 1" the job is performing extremely slow. Its like taking 3 extra hours for processing 200K records. I am setting the block to 1 as I need the surrogate key to be generated sequentially when I am using 4 nodes. For DB Sequence, the only provided option are DB2 and Oracle. My XMETA is on DB2 and we also have Netezza for our DW but its not listed. How do I use the DB2 database for this?

Option # 3: Use the following derivation in the transformer. But to use this I am joining/looking up my target to fetch the max number generated in the previous run so as to continue the sequence. That's slowing down the job too.

@PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM - 1)) + NullToValue(field_name, 1)

Is there any better option other than above where the performance doesn't take a hit; surrogate key is generated sequentially on a 4-node environment; and it also continues from the point where it stopped in the previous run?

Thanks

Re: Surrogate Key using a state file or Database

Posted: Wed Oct 15, 2014 3:22 pm
by ray.wurlod
collabxchange wrote:I need the surrogate key to be generated sequentially
Why?

A surrogate key guarantees uniqueness - nothing else. A bigger block size would be my main recommendation.

Posted: Wed Oct 15, 2014 3:35 pm
by collabxchange
That is true. I guess it was a requirement coming from a database person who has used DB sequence always but I agree with you. Would you recommend a specific block size of do you recommend the system selected block size?

Also, is there a way to tell DS to just maintain it in the XMETA database by default instead of maintaining it as a state file?


The other question I have on State File is that is it locked by the job when it is currently running? Say for e.g. I have 1 state file for Table A. Now Table A can be populated via 2 different jobs that can run in parallel. Can both these jobs uses this single state file at the same time?

Posted: Wed Oct 15, 2014 6:19 pm
by ray.wurlod
System generated block size should be OK unless you have some particular reason for differing.

There is no convenient way to maintain state information in XMETA. (And that's not what XMETA is for.)

Updating the state file from two different jobs is, at the very least, dangerous. But you can read (blocks of values) from the state file using as many jobs at a time as you desire.

Posted: Wed Oct 15, 2014 7:47 pm
by collabxchange
Thank you, Ray. I will use the state file in the jobs are see how it goes.

BTW, what is your opinion about using the inbuilt DB sequence? The only options are DB2 and Oracle. Since we have DB2 for XMETA repository, I can ask for a separate database in it for generating sequence. Will it provide any performance benefits as opposed to flat file approach?

Posted: Wed Oct 15, 2014 7:55 pm
by ray.wurlod
I doubt it. It might get close, if your engine, services and repository tiers share the same machine. But a state file is right there in the file system.

Posted: Fri Oct 17, 2014 5:12 am
by RPhani
Hi,

I think one more way to generate unique numbers.

Prepare a procedure and call it to job by StoredProcedure Stage.

Use this STP Stage before loading data to target.

Thanks,
RPhani

Posted: Fri Oct 17, 2014 10:30 am
by qt_ky
As a DataStage Developer, I always prefer to generate unique numbers directly within DataStage itself.

Any time you reach out to a database for whatever reason, it:
  • - creates extra database I/O operations
    - adds complexity
    - slows down performance
:wink: