Surrogate Key using a state file or Database

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Surrogate Key using a state file or Database

Post by collabxchange »

Hi All,
I am trying to generate surrogate key or a sequence number for one of my job. As I understand there are 3 ways to do it.

Option # 1: Use Database Sequence if your target is a table. Unfortunately, I can't use this.

Option # 2: Use Surrogate key generator or the surrogate key tab in the stages like transformer and then use a Flat File or DB sequence. If I go for the flat file approach with "In blocks of : 1" the job is performing extremely slow. Its like taking 3 extra hours for processing 200K records. I am setting the block to 1 as I need the surrogate key to be generated sequentially when I am using 4 nodes. For DB Sequence, the only provided option are DB2 and Oracle. My XMETA is on DB2 and we also have Netezza for our DW but its not listed. How do I use the DB2 database for this?

Option # 3: Use the following derivation in the transformer. But to use this I am joining/looking up my target to fetch the max number generated in the previous run so as to continue the sequence. That's slowing down the job too.

@PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM - 1)) + NullToValue(field_name, 1)

Is there any better option other than above where the performance doesn't take a hit; surrogate key is generated sequentially on a 4-node environment; and it also continues from the point where it stopped in the previous run?

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: Surrogate Key using a state file or Database

Post by ray.wurlod »

collabxchange wrote:I need the surrogate key to be generated sequentially
Why?

A surrogate key guarantees uniqueness - nothing else. A bigger block size would be my main recommendation.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

That is true. I guess it was a requirement coming from a database person who has used DB sequence always but I agree with you. Would you recommend a specific block size of do you recommend the system selected block size?

Also, is there a way to tell DS to just maintain it in the XMETA database by default instead of maintaining it as a state file?


The other question I have on State File is that is it locked by the job when it is currently running? Say for e.g. I have 1 state file for Table A. Now Table A can be populated via 2 different jobs that can run in parallel. Can both these jobs uses this single state file at the same time?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

System generated block size should be OK unless you have some particular reason for differing.

There is no convenient way to maintain state information in XMETA. (And that's not what XMETA is for.)

Updating the state file from two different jobs is, at the very least, dangerous. But you can read (blocks of values) from the state file using as many jobs at a time as you desire.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
collabxchange
Premium Member
Premium Member
Posts: 34
Joined: Thu Aug 28, 2014 8:48 pm
Location: United States

Post by collabxchange »

Thank you, Ray. I will use the state file in the jobs are see how it goes.

BTW, what is your opinion about using the inbuilt DB sequence? The only options are DB2 and Oracle. Since we have DB2 for XMETA repository, I can ask for a separate database in it for generating sequence. Will it provide any performance benefits as opposed to flat file approach?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I doubt it. It might get close, if your engine, services and repository tiers share the same machine. But a state file is right there in the file system.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
RPhani
Participant
Posts: 32
Joined: Sun Aug 26, 2012 7:03 am
Location: Hyd

Post by RPhani »

Hi,

I think one more way to generate unique numbers.

Prepare a procedure and call it to job by StoredProcedure Stage.

Use this STP Stage before loading data to target.

Thanks,
RPhani
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

As a DataStage Developer, I always prefer to generate unique numbers directly within DataStage itself.

Any time you reach out to a database for whatever reason, it:
  • - creates extra database I/O operations
    - adds complexity
    - slows down performance
:wink:
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply