Question regarding Surrogate Key / State Files
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
Question regarding Surrogate Key / State Files
Gurus,
This post is with regards to Surrogate Key / Sequence generation and State files.
Scenario:
I have an bunch of dimension tables (10-12) which have a Sequence number generated (Block size of 100,000). In the target, the field datatype is Decimal( 18 ).
The requirement is to generate an 18-digit number, which should not be repeated anywhere else within those dimension tables. I mean, if a Surrogate/Sequence number is generated, it should not be generated again for any other table.
FYI...data is already present in those tables. The existing logic (in a different tool) is to generate an integer and add an 18 digit number (5041000...00) to that. Also, this should be generated in 100000 blocks. I mean, if I generated numbers from 1 to 200 today, tomorrow it should generate from 100001 and so on {I had an earlier post about this...which had a glibc error...that got resolved}
My solution:
Load all the existing Surrogate/Sequence numbers into one state file, so that the existing numbers won't get generated again. The same state file should be used throughout all the jobs (should be run in sequence, of course).
Questions:
> Some tables have more than 25 million rows in them. So, the state file with all those ID values might contain more than 70-80 million IDs. Is this a good approach to ensure the numbers generated are unique through out the database as I am concerned that the performance of the job that utilizes this state file will be really slow.
> How does state file store the IDs that I insert? As ranges or as individual numbers?
> I noticed that we can call a state file from Transformer. Does the state file get updated with the IDs generated within this process once the job is finished?
> How do I ensure that the numbers are generated in a sequence (not in this scenario, but in general)?
Thanks in advance for any inputs on this...
This post is with regards to Surrogate Key / Sequence generation and State files.
Scenario:
I have an bunch of dimension tables (10-12) which have a Sequence number generated (Block size of 100,000). In the target, the field datatype is Decimal( 18 ).
The requirement is to generate an 18-digit number, which should not be repeated anywhere else within those dimension tables. I mean, if a Surrogate/Sequence number is generated, it should not be generated again for any other table.
FYI...data is already present in those tables. The existing logic (in a different tool) is to generate an integer and add an 18 digit number (5041000...00) to that. Also, this should be generated in 100000 blocks. I mean, if I generated numbers from 1 to 200 today, tomorrow it should generate from 100001 and so on {I had an earlier post about this...which had a glibc error...that got resolved}
My solution:
Load all the existing Surrogate/Sequence numbers into one state file, so that the existing numbers won't get generated again. The same state file should be used throughout all the jobs (should be run in sequence, of course).
Questions:
> Some tables have more than 25 million rows in them. So, the state file with all those ID values might contain more than 70-80 million IDs. Is this a good approach to ensure the numbers generated are unique through out the database as I am concerned that the performance of the job that utilizes this state file will be really slow.
> How does state file store the IDs that I insert? As ranges or as individual numbers?
> I noticed that we can call a state file from Transformer. Does the state file get updated with the IDs generated within this process once the job is finished?
> How do I ensure that the numbers are generated in a sequence (not in this scenario, but in general)?
Thanks in advance for any inputs on this...
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
I was under the same impression too. But, I saw some jobs that were existing in this shop that have all the IDs loaded (for them, the ID field is not unique...so they were loading DISTINCT values into the state file). When inquired, they told me that this way, DataStage ensures that duplicates are not issued by reading the state file, and that State file stores all the numbers to achieve this. Unfortunately, they are not here for the next month (holidays) and I came across this issue very recently.chulett wrote:I would imagine that a state file holds the same information that a sequence or other 'generator' type object would in a database - the most recent value.
Also, from the documentation for the Surrogate Key generator stage as well as reading some posts here, I understand that the numbers produced will not be in sequence. That is the reason why I got those questions in my mind while designing the process!
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
They will be in sequence, what they may not be is in a contiguous sequence - i.e. there may be "holes" in the sequence but that's not an issue for a surrogate key.
Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.
Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
Thanks for the response.chulett wrote:They will be in sequence, what they may not be is in a contiguous sequence - i.e. there may be "holes" in the sequence but that's not an issue for a surrogate key.
Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.
Yes, what you said makes sense. I guess, it will store the numbers in a particular way or store it in a range so that it will not output any numbers within that range. Hence, my earlier question about the storage.
Is there a way to make it contiguous? What if there is a requirement to generate a sequence number which is contiguous, like 1001, 1002...? How to make this sequence number unique wherever the state file is called? Let me put it this way...if I have a requirement exactly the same as I have right now, but they need a sequence number instead of a surrogate value?
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
A sequential number can only be guaranteed with sequential mode operation. "They" can have what they want, provided they don't also want minimum execution time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
Thanks Ray.ray.wurlod wrote:A sequential number can only be guaranteed with sequential mode operation. "They" can have what they want, provided they don't also want minimum execution time.
One last question: When a state file is called from a Transformer and NOT from a Surrogate Key Generator stage, will the file be updated back so that the next time I use it in a different job (as I cannot use it within the same job and expect a unique number), it will generate unique surrogate values?
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
Craig,chulett wrote:What makes you say that?vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
I have read it somewhere in one of the blogs and also I learnt it from one of the folks I talked to here. The explanation that I gathered is, as it is the same process and the same state file is accessed within it, it will produce duplicate values.
Please correct me if this is incorrect.
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
Thanks Ray. Then may be, I can use the state file from within the Transformer without explicitly using Surrogate Key Generator stage within the jobs. Of course, I realize that the initialization of the state file has to take place through the SK Generator stageray.wurlod wrote:Provided that it's not re-initialized, a state file will always yield unique key value(s).
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Unless I'm completely misunderstanding this, as Ray noted this should not be an issue. That functionality would be basically worthless if it worked like you're being told that it works.vivekgadwal wrote:I have read it somewhere in one of the blogs and also I learnt it from one of the folks I talked to here. The explanation that I gathered is, as it is the same process and the same state file is accessed within it, it will produce duplicate values.chulett wrote:What makes you say that?vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
Please correct me if this is incorrect.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
I will test this and let you know. Perhaps, that is the best way, as Ray notedchulett wrote: Unless I'm completely misunderstanding this, as Ray noted this should not be an issue. That functionality would be basically worthless if it worked like you're being told that it works.
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted
-
- Premium Member
- Posts: 457
- Joined: Tue Sep 25, 2007 4:05 pm
I am reviving this topic again, as I have a couple of questions with regards to the state files which are impeding performance for some jobs.
Scenario:
I saw in the state file that it is storing ranges of values that are not yet generated.
In our shop, the state file block size is set to be 500,000 for all the jobs (different set of jobs than the one I mentioned in the beginning of this topic). There are two specific jobs that use two different state files (Settings are the same) which I would like to compare. One job has load volume = 50,000 to 100,000 per day and the other has load volume = 5,000 to 7,000 per day. The former job runs in a few seconds to a minute, but the latter runs about 10-15 minutes. Most of the logic is comparable.
When I look up the file sizes on Unix, the state file size for the first job is "64" and the other is "5424" (approx. sizes). I monitored the jobs when they are running and the bottleneck seems to be at the Surrogate key generator stage.
This bring me to the questions:-
Do the keys from the "holes" within the state file (block size related) get used in the next run?
for eg. assuming a two node system...
Is it how this works? Or...
In the second case, there will be huge blocks of keys that won't be used. I am not very clear on this concept.
I am looking to increase the performance of the above mentioned job. How can I do that w.r.t the state file?
Scenario:
I saw in the state file that it is storing ranges of values that are not yet generated.
In our shop, the state file block size is set to be 500,000 for all the jobs (different set of jobs than the one I mentioned in the beginning of this topic). There are two specific jobs that use two different state files (Settings are the same) which I would like to compare. One job has load volume = 50,000 to 100,000 per day and the other has load volume = 5,000 to 7,000 per day. The former job runs in a few seconds to a minute, but the latter runs about 10-15 minutes. Most of the logic is comparable.
When I look up the file sizes on Unix, the state file size for the first job is "64" and the other is "5424" (approx. sizes). I monitored the jobs when they are running and the bottleneck seems to be at the Surrogate key generator stage.
This bring me to the questions:-
Do the keys from the "holes" within the state file (block size related) get used in the next run?
for eg. assuming a two node system...
Code: Select all
Run 1: 1-2500 (block 1: 1 - 500,000)
500001-502500 (block 2)
Run 2: 2501-5000 (block 1)
5002501-5005000 (block 2)
Code: Select all
Run 1: 1-2500 (block 1: 1 - 500,000)
500001-502500 (block 2)
Run 2: 1000001-1002500 (block 1)
1500001-1502500 (block 2)
I am looking to increase the performance of the above mentioned job. How can I do that w.r.t the state file?
Vivek Gadwal
Experience is what you get when you didn't get what you wanted
Experience is what you get when you didn't get what you wanted