Unique Sequence Number Generation In DataStage
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 4
- Joined: Fri Aug 25, 2006 6:09 am
Unique Sequence Number Generation In DataStage
Hi,
In my job I am selecting data from a table , performing some transformations and then insert the data into a similar table. There is a particular id field which is not present in my source table but I need to have the field in the target field. The value should be unique sequence numbers starting from 000001 to 099999. And the unique numbers should be generated by the datastage job. Also there are only 15000 records coming from the input.
There is an additional requirement that if we need to re run this job twice , then the sequence numbers that are generated should not start from 000001 but should start from the last id generated from the first run.
I am relatively new to datastage and looking forward to some suggestions from you.
In my job I am selecting data from a table , performing some transformations and then insert the data into a similar table. There is a particular id field which is not present in my source table but I need to have the field in the target field. The value should be unique sequence numbers starting from 000001 to 099999. And the unique numbers should be generated by the datastage job. Also there are only 15000 records coming from the input.
There is an additional requirement that if we need to re run this job twice , then the sequence numbers that are generated should not start from 000001 but should start from the last id generated from the first run.
I am relatively new to datastage and looking forward to some suggestions from you.
There are many ways to do this. Surrogate key gen can be used. Manual key can also be generated using @INROWNUM/@OUTROWNUM function.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 4
- Joined: Fri Aug 25, 2006 6:09 am
If I use a surrogate key generator does this mean that if i run my job second time for whatsoever reasons, the id will be generated from the max id of the last run.kumar_s wrote:There are many ways to do this. Surrogate key gen can be used. Manual key can also be generated using @INROWNUM/@OUTROWNUM function.
For the second requirement, you can additionally do a Max(Key) from the target and assign to a file or to a UserStaus. And pass it as a parameter to the next job. You can search for UserStatus for detailed explaination of the process. But simply you can opt the first approach to use Surrogatekey gen.
USERSTATUS is NOT available to a EE job. Remember that @INROWNUM and @OUTROWNUM will be identical per node. Check out @PARTITIONNUM to alleviate this problem.
YES - The Surrogate key generator will generate the same keys for each run. How can it not since there is no communication of the surrogate key from job run to job run? For this communication (seed) between job runs, you'll need to land the max key of the previous run to disk or db and get it via a parameter.
YES - The Surrogate key generator will generate the same keys for each run. How can it not since there is no communication of the surrogate key from job run to job run? For this communication (seed) between job runs, you'll need to land the max key of the previous run to disk or db and get it via a parameter.
You ll need to find MAX for previous run and write it into to a file and pass it as a parameter using ExecuteCommand Activity as CommandOutput or as mentioned, use MAX(field) during extraction (if its on same database.)
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 437
- Joined: Fri Oct 15, 2004 6:13 am
- Location: Pune, India
whether you use the surrogate key stage or generate the keys manually using (IN/OUT)ROWNUM, make sure that when create keys, the partitioning is balanced, meaning in each partition, the total number of rows are same.
If the partitions are unbalanced i.e. one having 10 rows and other having other 15 rows, then you may have holes in you generated keys.
If the partitions are unbalanced i.e. one having 10 rows and other having other 15 rows, then you may have holes in you generated keys.
Regards,
S. Kirtikumar.
S. Kirtikumar.
-
- Participant
- Posts: 56
- Joined: Mon Oct 16, 2006 7:32 am
That is why i pointed out Vincent's FAQ entry about generation of keys in a px environment. It takes into consideration the number of partitions and current partition number. Refer to my previous post, it has reference to a link.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
If you wanted just an integer ID (1 to 999999) you could use the surrogate key generator, but since you want varchar keys with leading 0s you are better off with the Transformer ID generator in the FAQ linked by DSguru2B. With multiple partitions you will probably get a small number of gaps in the sequence at the end of each run. Eg. 1, 2, 3, 4, 5, 6, 8, 10. To avoid this (regardless of the number of input records) you can force your job to run as a single thread by adding the CONFIG file environment variable and pointing it at a config file with a single node defined.
To set the initial value of the ID I usually pass it in as a job parameter from a Sequence job. I use Set Default Variables sequence job stage to run an operating script to retrieve the maximum value from the target table via command line SQL.
Eg. SELECT MAX(MYID) FROM MYTABLE
This job parameter can then be used in the initial value of the ID field.
To set the initial value of the ID I usually pass it in as a job parameter from a Sequence job. I use Set Default Variables sequence job stage to run an operating script to retrieve the maximum value from the target table via command line SQL.
Eg. SELECT MAX(MYID) FROM MYTABLE
This job parameter can then be used in the initial value of the ID field.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
How are you effectively using lookup stage. Bare in mind that the data will be lookup the reference for each record input stream. Finding max after doing all lookup wont be much effective. Isolating this functionality and passing it as parameter might cause less overhead.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'