Question regarding Surrogate Key / State Files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Question regarding Surrogate Key / State Files

Post by vivekgadwal »

Gurus,

This post is with regards to Surrogate Key / Sequence generation and State files.

Scenario:
I have an bunch of dimension tables (10-12) which have a Sequence number generated (Block size of 100,000). In the target, the field datatype is Decimal( 18 ).
The requirement is to generate an 18-digit number, which should not be repeated anywhere else within those dimension tables. I mean, if a Surrogate/Sequence number is generated, it should not be generated again for any other table.

FYI...data is already present in those tables. The existing logic (in a different tool) is to generate an integer and add an 18 digit number (5041000...00) to that. Also, this should be generated in 100000 blocks. I mean, if I generated numbers from 1 to 200 today, tomorrow it should generate from 100001 and so on {I had an earlier post about this...which had a glibc error...that got resolved}

My solution:
Load all the existing Surrogate/Sequence numbers into one state file, so that the existing numbers won't get generated again. The same state file should be used throughout all the jobs (should be run in sequence, of course).

Questions:
> Some tables have more than 25 million rows in them. So, the state file with all those ID values might contain more than 70-80 million IDs. Is this a good approach to ensure the numbers generated are unique through out the database as I am concerned that the performance of the job that utilizes this state file will be really slow.
> How does state file store the IDs that I insert? As ranges or as individual numbers?
> I noticed that we can call a state file from Transformer. Does the state file get updated with the IDs generated within this process once the job is finished?
> How do I ensure that the numbers are generated in a sequence (not in this scenario, but in general)?

Thanks in advance for any inputs on this...
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I would imagine that a state file holds the same information that a sequence or other 'generator' type object would in a database - the most recent value.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

chulett wrote:I would imagine that a state file holds the same information that a sequence or other 'generator' type object would in a database - the most recent value.
I was under the same impression too. But, I saw some jobs that were existing in this shop that have all the IDs loaded (for them, the ID field is not unique...so they were loading DISTINCT values into the state file). When inquired, they told me that this way, DataStage ensures that duplicates are not issued by reading the state file, and that State file stores all the numbers to achieve this. Unfortunately, they are not here for the next month (holidays) and I came across this issue very recently.

Also, from the documentation for the Surrogate Key generator stage as well as reading some posts here, I understand that the numbers produced will not be in sequence. That is the reason why I got those questions in my mind while designing the process! :)
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

They will be in sequence, what they may not be is in a contiguous sequence - i.e. there may be "holes" in the sequence but that's not an issue for a surrogate key.

Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

chulett wrote:They will be in sequence, what they may not be is in a contiguous sequence - i.e. there may be "holes" in the sequence but that's not an issue for a surrogate key.

Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.
Thanks for the response.

Yes, what you said makes sense. I guess, it will store the numbers in a particular way or store it in a range so that it will not output any numbers within that range. Hence, my earlier question about the storage.

Is there a way to make it contiguous? What if there is a requirement to generate a sequence number which is contiguous, like 1001, 1002...? How to make this sequence number unique wherever the state file is called? Let me put it this way...if I have a requirement exactly the same as I have right now, but they need a sequence number instead of a surrogate value?
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A sequential number can only be guaranteed with sequential mode operation. "They" can have what they want, provided they don't also want minimum execution time.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

ray.wurlod wrote:A sequential number can only be guaranteed with sequential mode operation. "They" can have what they want, provided they don't also want minimum execution time.
Thanks Ray.

One last question: When a state file is called from a Transformer and NOT from a Surrogate Key Generator stage, will the file be updated back so that the next time I use it in a different job (as I cannot use it within the same job and expect a unique number), it will generate unique surrogate values?
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
What makes you say that? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Provided that it's not re-initialized, a state file will always yield unique key value(s).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

chulett wrote:
vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
What makes you say that? :?
Craig,

I have read it somewhere in one of the blogs and also I learnt it from one of the folks I talked to here. The explanation that I gathered is, as it is the same process and the same state file is accessed within it, it will produce duplicate values.

Please correct me if this is incorrect.
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

ray.wurlod wrote:Provided that it's not re-initialized, a state file will always yield unique key value(s).
Thanks Ray. Then may be, I can use the state file from within the Transformer without explicitly using Surrogate Key Generator stage within the jobs. Of course, I realize that the initialization of the state file has to take place through the SK Generator stage :)
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I would certainly say so (precluding any bugs, of which there were some in version 8.0). Try it, and let us know.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

vivekgadwal wrote:
chulett wrote:
vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
What makes you say that? :?
I have read it somewhere in one of the blogs and also I learnt it from one of the folks I talked to here. The explanation that I gathered is, as it is the same process and the same state file is accessed within it, it will produce duplicate values.

Please correct me if this is incorrect.
Unless I'm completely misunderstanding this, as Ray noted this should not be an issue. That functionality would be basically worthless if it worked like you're being told that it works. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

chulett wrote: Unless I'm completely misunderstanding this, as Ray noted this should not be an issue. That functionality would be basically worthless if it worked like you're being told that it works. :?
I will test this and let you know. Perhaps, that is the best way, as Ray noted :)
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
vivekgadwal
Premium Member
Premium Member
Posts: 457
Joined: Tue Sep 25, 2007 4:05 pm

Post by vivekgadwal »

I am reviving this topic again, as I have a couple of questions with regards to the state files which are impeding performance for some jobs.

Scenario:
I saw in the state file that it is storing ranges of values that are not yet generated.

In our shop, the state file block size is set to be 500,000 for all the jobs (different set of jobs than the one I mentioned in the beginning of this topic). There are two specific jobs that use two different state files (Settings are the same) which I would like to compare. One job has load volume = 50,000 to 100,000 per day and the other has load volume = 5,000 to 7,000 per day. The former job runs in a few seconds to a minute, but the latter runs about 10-15 minutes. Most of the logic is comparable.

When I look up the file sizes on Unix, the state file size for the first job is "64" and the other is "5424" (approx. sizes). I monitored the jobs when they are running and the bottleneck seems to be at the Surrogate key generator stage.

This bring me to the questions:-
:?: Do the keys from the "holes" within the state file (block size related) get used in the next run?
for eg. assuming a two node system...

Code: Select all

Run 1: 1-2500 (block 1: 1 - 500,000)
       500001-502500 (block 2)
Run 2: 2501-5000 (block 1)
       5002501-5005000 (block 2)
Is it how this works? Or...

Code: Select all

Run 1: 1-2500 (block 1: 1 - 500,000)
       500001-502500 (block 2)
Run 2: 1000001-1002500 (block 1)
       1500001-1502500 (block 2)
In the second case, there will be huge blocks of keys that won't be used. I am not very clear on this concept.

:?: I am looking to increase the performance of the above mentioned job. How can I do that w.r.t the state file?
Vivek Gadwal

Experience is what you get when you didn't get what you wanted
Post Reply