Question regarding Surrogate Key / State Files

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 2:50 pm

Gurus,

This post is with regards to Surrogate Key / Sequence generation and State files.

Scenario:
I have an bunch of dimension tables (10-12) which have a Sequence number generated (Block size of 100,000). In the target, the field datatype is Decimal( 18 ).
The requirement is to generate an 18-digit number, which should not be repeated anywhere else within those dimension tables. I mean, if a Surrogate/Sequence number is generated, it should not be generated again for any other table.

FYI...data is already present in those tables. The existing logic (in a different tool) is to generate an integer and add an 18 digit number (5041000...00) to that. Also, this should be generated in 100000 blocks. I mean, if I generated numbers from 1 to 200 today, tomorrow it should generate from 100001 and so on {I had an earlier post about this...which had a glibc error...that got resolved}

My solution:
Load all the existing Surrogate/Sequence numbers into one state file, so that the existing numbers won't get generated again. The same state file should be used throughout all the jobs (should be run in sequence, of course).

Questions:
> Some tables have more than 25 million rows in them. So, the state file with all those ID values might contain more than 70-80 million IDs. Is this a good approach to ensure the numbers generated are unique through out the database as I am concerned that the performance of the job that utilizes this state file will be really slow.
> How does state file store the IDs that I insert? As ranges or as individual numbers?
> I noticed that we can call a state file from Transformer. Does the state file get updated with the IDs generated within this process once the job is finished?
> How do I ensure that the numbers are generated in a sequence (not in this scenario, but in general)?

Thanks in advance for any inputs on this...

chulett · Post by **chulett** » Mon Dec 21, 2009 3:17 pm

I would imagine that a state file holds the same information that a sequence or other 'generator' type object would in a database - the most recent value.

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 3:57 pm

chulett wrote:I would imagine that a state file holds the same information that a sequence or other 'generator' type object would in a database - the most recent value.

I was under the same impression too. But, I saw some jobs that were existing in this shop that have all the IDs loaded (for them, the ID field is not unique...so they were loading DISTINCT values into the state file). When inquired, they told me that this way, DataStage ensures that duplicates are not issued by reading the state file, and that State file stores all the numbers to achieve this. Unfortunately, they are not here for the next month (holidays) and I came across this issue very recently.

Also, from the documentation for the Surrogate Key generator stage as well as reading some posts here, I understand that the numbers produced will not be in sequence. That is the reason why I got those questions in my mind while designing the process!

chulett · Post by **chulett** » Mon Dec 21, 2009 4:19 pm

They will be in sequence, what they may not be is in a contiguous sequence - i.e. there may be "holes" in the sequence but that's not an issue for a surrogate key.

Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 5:37 pm

chulett wrote:They will be in sequence, what they may not be is in a contiguous sequence - i.e. there may be "holes" in the sequence but that's not an issue for a surrogate key.

Think about it - there's no way the stage is checking that entire list every time it issues a number to ensure it doesn't re-issue a old number. It "knows" it shouldn't be issuing duplicates because it always takes the highest / last number issued and increments it by 1. The only way that doesn't work is if you lie to it and improperly initialize the state file. That or don't handle concurrency properly.

Thanks for the response.

Yes, what you said makes sense. I guess, it will store the numbers in a particular way or store it in a range so that it will not output any numbers within that range. Hence, my earlier question about the storage.

Is there a way to make it contiguous? What if there is a requirement to generate a sequence number which is contiguous, like 1001, 1002...? How to make this sequence number unique wherever the state file is called? Let me put it this way...if I have a requirement exactly the same as I have right now, but they need a sequence number instead of a surrogate value?

ray.wurlod · Post by **ray.wurlod** » Mon Dec 21, 2009 5:57 pm

A sequential number can only be guaranteed with sequential mode operation. "They" can have what they want, provided they don't also want minimum execution time.

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 6:04 pm

ray.wurlod wrote:A sequential number can only be guaranteed with sequential mode operation. "They" can have what they want, provided they don't also want minimum execution time.

Thanks Ray.

One last question: When a state file is called from a Transformer and NOT from a Surrogate Key Generator stage, will the file be updated back so that the next time I use it in a different job (as I cannot use it within the same job and expect a unique number), it will generate unique surrogate values?

chulett · Post by **chulett** » Mon Dec 21, 2009 6:28 pm

vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)

What makes you say that?

ray.wurlod · Post by **ray.wurlod** » Mon Dec 21, 2009 6:36 pm

Provided that it's not re-initialized, a state file will always yield unique key value(s).

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 6:41 pm

chulett wrote:
vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
What makes you say that?

Craig,

I have read it somewhere in one of the blogs and also I learnt it from one of the folks I talked to here. The explanation that I gathered is, as it is the same process and the same state file is accessed within it, it will produce duplicate values.

Please correct me if this is incorrect.

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 6:42 pm

ray.wurlod wrote:Provided that it's not re-initialized, a state file will always yield unique key value(s).

Thanks Ray. Then may be, I can use the state file from within the Transformer without explicitly using Surrogate Key Generator stage within the jobs. Of course, I realize that the initialization of the state file has to take place through the SK Generator stage

ray.wurlod · Post by **ray.wurlod** » Mon Dec 21, 2009 7:07 pm

I would certainly say so (precluding any bugs, of which there were some in version 8.0). Try it, and let us know.

chulett · Post by **chulett** » Mon Dec 21, 2009 7:23 pm

vivekgadwal wrote:
chulett wrote:
vivekgadwal wrote:(as I cannot use it within the same job and expect a unique number)
What makes you say that?
I have read it somewhere in one of the blogs and also I learnt it from one of the folks I talked to here. The explanation that I gathered is, as it is the same process and the same state file is accessed within it, it will produce duplicate values.

Please correct me if this is incorrect.

Unless I'm completely misunderstanding this, as Ray noted this should not be an issue. That functionality would be basically worthless if it worked like you're being told that it works.

vivekgadwal · Post by **vivekgadwal** » Mon Dec 21, 2009 8:06 pm

chulett wrote: Unless I'm completely misunderstanding this, as Ray noted this should not be an issue. That functionality would be basically worthless if it worked like you're being told that it works.

I will test this and let you know. Perhaps, that is the best way, as Ray noted

vivekgadwal · Post by **vivekgadwal** » Tue Nov 09, 2010 8:44 am

I am reviving this topic again, as I have a couple of questions with regards to the state files which are impeding performance for some jobs.

Scenario:
I saw in the state file that it is storing ranges of values that are not yet generated.

In our shop, the state file block size is set to be 500,000 for all the jobs (different set of jobs than the one I mentioned in the beginning of this topic). There are two specific jobs that use two different state files (Settings are the same) which I would like to compare. One job has load volume = 50,000 to 100,000 per day and the other has load volume = 5,000 to 7,000 per day. The former job runs in a few seconds to a minute, but the latter runs about 10-15 minutes. Most of the logic is comparable.

When I look up the file sizes on Unix, the state file size for the first job is "64" and the other is "5424" (approx. sizes). I monitored the jobs when they are running and the bottleneck seems to be at the Surrogate key generator stage.

This bring me to the questions:-

Do the keys from the "holes" within the state file (block size related) get used in the next run?
for eg. assuming a two node system...

Code: Select all

Run 1: 1-2500 (block 1: 1 - 500,000)
       500001-502500 (block 2)
Run 2: 2501-5000 (block 1)
       5002501-5005000 (block 2)

Is it how this works? Or...

Code: Select all

Run 1: 1-2500 (block 1: 1 - 500,000)
       500001-502500 (block 2)
Run 2: 1000001-1002500 (block 1)
       1500001-1502500 (block 2)

In the second case, there will be huge blocks of keys that won't be used. I am not very clear on this concept.

I am looking to increase the performance of the above mentioned job. How can I do that w.r.t the state file?