Unique Sequence Number Generation In DataStage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
castor.troy81
Participant
Posts: 4
Joined: Fri Aug 25, 2006 6:09 am

Unique Sequence Number Generation In DataStage

Post by castor.troy81 »

Hi,

In my job I am selecting data from a table , performing some transformations and then insert the data into a similar table. There is a particular id field which is not present in my source table but I need to have the field in the target field. The value should be unique sequence numbers starting from 000001 to 099999. And the unique numbers should be generated by the datastage job. Also there are only 15000 records coming from the input.

There is an additional requirement that if we need to re run this job twice , then the sequence numbers that are generated should not start from 000001 but should start from the last id generated from the first run.

I am relatively new to datastage and looking forward to some suggestions from you.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

There are many ways to do this. Surrogate key gen can be used. Manual key can also be generated using @INROWNUM/@OUTROWNUM function.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
castor.troy81
Participant
Posts: 4
Joined: Fri Aug 25, 2006 6:09 am

Post by castor.troy81 »

kumar_s wrote:There are many ways to do this. Surrogate key gen can be used. Manual key can also be generated using @INROWNUM/@OUTROWNUM function.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.
If I use a surrogate key generator does this mean that if i run my job second time for whatsoever reasons, the id will be generated from the max id of the last run.
trobinson
Participant
Posts: 208
Joined: Thu Apr 11, 2002 6:02 am
Location: Saint Louis
Contact:

Post by trobinson »

USERSTATUS is NOT available to a EE job. Remember that @INROWNUM and @OUTROWNUM will be identical per node. Check out @PARTITIONNUM to alleviate this problem.

YES - The Surrogate key generator will generate the same keys for each run. How can it not since there is no communication of the surrogate key from job run to job run? For this communication (seed) between job runs, you'll need to land the max key of the previous run to disk or db and get it via a parameter.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Read this post by Vincent.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
trobinson
Participant
Posts: 208
Joined: Thu Apr 11, 2002 6:02 am
Location: Saint Louis
Contact:

Post by trobinson »

I find spurious and irrelevant postings to be annoying. Each one should serve a specific detailed purpose that illuminates and advances the cause of Websphere!

or not.

1000 points! Wahoo! I'll take the dsxchange T-shirt; blue, size medium...
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

You ll need to find MAX for previous run and write it into to a file and pass it as a parameter using ExecuteCommand Activity as CommandOutput or as mentioned, use MAX(field) during extraction (if its on same database.)
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

whether you use the surrogate key stage or generate the keys manually using (IN/OUT)ROWNUM, make sure that when create keys, the partitioning is balanced, meaning in each partition, the total number of rows are same.

If the partitions are unbalanced i.e. one having 10 rows and other having other 15 rows, then you may have holes in you generated keys.
Regards,
S. Kirtikumar.
johnthomas
Participant
Posts: 56
Joined: Mon Oct 16, 2006 7:32 am

Post by johnthomas »

For getting the Max key i used the lookup stage to get the max value generated during previous run . Some of the other options suggested works only with server job .
JT
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

That is why i pointed out Vincent's FAQ entry about generation of keys in a px environment. It takes into consideration the number of partitions and current partition number. Refer to my previous post, it has reference to a link.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

If you wanted just an integer ID (1 to 999999) you could use the surrogate key generator, but since you want varchar keys with leading 0s you are better off with the Transformer ID generator in the FAQ linked by DSguru2B. With multiple partitions you will probably get a small number of gaps in the sequence at the end of each run. Eg. 1, 2, 3, 4, 5, 6, 8, 10. To avoid this (regardless of the number of input records) you can force your job to run as a single thread by adding the CONFIG file environment variable and pointing it at a config file with a single node defined.

To set the initial value of the ID I usually pass it in as a job parameter from a Sequence job. I use Set Default Variables sequence job stage to run an operating script to retrieve the maximum value from the target table via command line SQL.

Eg. SELECT MAX(MYID) FROM MYTABLE

This job parameter can then be used in the initial value of the ID field.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

How are you effectively using lookup stage. Bare in mind that the data will be lookup the reference for each record input stream. Finding max after doing all lookup wont be much effective. Isolating this functionality and passing it as parameter might cause less overhead.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply