Unique Sequence Number Generation In DataStage

castor.troy81 · Post by **castor.troy81** » Tue Jan 09, 2007 4:49 am

Hi,

In my job I am selecting data from a table , performing some transformations and then insert the data into a similar table. There is a particular id field which is not present in my source table but I need to have the field in the target field. The value should be unique sequence numbers starting from 000001 to 099999. And the unique numbers should be generated by the datastage job. Also there are only 15000 records coming from the input.

There is an additional requirement that if we need to re run this job twice , then the sequence numbers that are generated should not start from 000001 but should start from the last id generated from the first run.

I am relatively new to datastage and looking forward to some suggestions from you.

kumar_s · Post by **kumar_s** » Tue Jan 09, 2007 4:57 am

There are many ways to do this. Surrogate key gen can be used. Manual key can also be generated using @INROWNUM/@OUTROWNUM function.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.

castor.troy81 · Post by **castor.troy81** » Tue Jan 09, 2007 5:00 am

kumar_s wrote:There are many ways to do this. Surrogate key gen can be used. Manual key can also be generated using @INROWNUM/@OUTROWNUM function.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.

If I use a surrogate key generator does this mean that if i run my job second time for whatsoever reasons, the id will be generated from the max id of the last run.

trobinson · Post by **trobinson** » Tue Jan 09, 2007 8:39 am

USERSTATUS is NOT available to a EE job. Remember that @INROWNUM and @OUTROWNUM will be identical per node. Check out @PARTITIONNUM to alleviate this problem.

YES - The Surrogate key generator will generate the same keys for each run. How can it not since there is no communication of the surrogate key from job run to job run? For this communication (seed) between job runs, you'll need to land the max key of the previous run to disk or db and get it via a parameter.

DSguru2B · Post by **DSguru2B** » Tue Jan 09, 2007 8:48 am

Read this post by Vincent.

trobinson · Post by **trobinson** » Tue Jan 09, 2007 8:59 am

I find spurious and irrelevant postings to be annoying. Each one should serve a specific detailed purpose that illuminates and advances the cause of Websphere!

or not.

1000 points! Wahoo! I'll take the dsxchange T-shirt; blue, size medium...

kumar_s · Post by **kumar_s** » Tue Jan 09, 2007 5:49 pm

You ll need to find MAX for previous run and write it into to a file and pass it as a parameter using ExecuteCommand Activity as CommandOutput or as mentioned, use MAX(field) during extraction (if its on same database.)

Kirtikumar · Post by **Kirtikumar** » Wed Jan 10, 2007 1:04 am

whether you use the surrogate key stage or generate the keys manually using (IN/OUT)ROWNUM, make sure that when create keys, the partitioning is balanced, meaning in each partition, the total number of rows are same.

If the partitions are unbalanced i.e. one having 10 rows and other having other 15 rows, then you may have holes in you generated keys.

johnthomas · Post by **johnthomas** » Wed Jan 10, 2007 6:35 am

For getting the Max key i used the lookup stage to get the max value generated during previous run . Some of the other options suggested works only with server job .

DSguru2B · Post by **DSguru2B** » Wed Jan 10, 2007 7:38 am

That is why i pointed out Vincent's FAQ entry about generation of keys in a px environment. It takes into consideration the number of partitions and current partition number. Refer to my previous post, it has reference to a link.

vmcburney · Post by **vmcburney** » Wed Jan 10, 2007 4:56 pm

If you wanted just an integer ID (1 to 999999) you could use the surrogate key generator, but since you want varchar keys with leading 0s you are better off with the Transformer ID generator in the FAQ linked by DSguru2B. With multiple partitions you will probably get a small number of gaps in the sequence at the end of each run. Eg. 1, 2, 3, 4, 5, 6, 8, 10. To avoid this (regardless of the number of input records) you can force your job to run as a single thread by adding the CONFIG file environment variable and pointing it at a config file with a single node defined.

To set the initial value of the ID I usually pass it in as a job parameter from a Sequence job. I use Set Default Variables sequence job stage to run an operating script to retrieve the maximum value from the target table via command line SQL.

Eg. SELECT MAX(MYID) FROM MYTABLE

This job parameter can then be used in the initial value of the ID field.

kumar_s · Post by **kumar_s** » Wed Jan 10, 2007 5:05 pm

How are you effectively using lookup stage. Bare in mind that the data will be lookup the reference for each record input stream. Finding max after doing all lookup wont be much effective. Isolating this functionality and passing it as parameter might cause less overhead.