Generate random/hash identifier
Moderators: chulett, rschirm, roy
Generate random/hash identifier
I need to generate a random/hash identifier for the incoming key fields, that can be used as a unique identifier on database.
Does Ascential has any function or stage which will create a unique hash values for input?
Thanks in advance.
Does Ascential has any function or stage which will create a unique hash values for input?
Thanks in advance.
Re: Generate random/hash identifier
Stage Variables in transform stage.
Create a field, with the default value to whatever you want.
Then do a value of the field name + 1.
Every row will spit out an increasing number.
-T.J.
Create a field, with the default value to whatever you want.
Then do a value of the field name + 1.
Every row will spit out an increasing number.
-T.J.
Last edited by Teej on Mon Dec 01, 2003 11:39 am, edited 1 time in total.
Developer of DataStage Parallel Engine (Orchestrate).
One more reason why I want a unique hash generator is I need to run more than one job which simultaneously inserts into the table(data from two different sources). If I use hash generator this will not be a problem. But if I use one up sequence then there will be duplicates. Can anybody throw some light on this as which are all the options available in datastage for this?
Thanks much!
Thanks much!
Ken Bland has spoken about this very thing via this forum (and given examples) many times so you should search the forum for posts by Ken. His solutions are simple and elegant for handling exactly what you want to do.
Regards,
Michael Hester
Regards,
Michael Hester
Mike Hester
mhester@petra-ps.com
mhester@petra-ps.com
Anytime you have multiple independent processes creating output for the same target table, you will have to deal with surrogate key assignment.
First of all, you have to define more clearly the type of loads that are taking place. Do you have data in two different processes that could generate the same row, so the first in is the insert and the second is the update? Or, is your data fully discrete and you simply need a unique identifier?
Database performance sucks if you are using a database to assign a surrogate key, using something like a serial column or a sequence generator. This is because you have the maintenance of creating the sequence, and this also puts a congestion point around the sequence as multiple parallel processes within PX will be fighting over it.
Your preferred choice should be to select the maximum surrogate key in use, and then have your jobstream start assignments from there. Since each node has an identifier, you have to figure how in PX to pass the starting surrogate value to each partitioned process. In Server jobs this is quite easy, as you use job instances and you have job parameters at that level. But in PX, you will need each transformer to be aware of the seed value and then what is the node number. For example, if you selected MAX on your table, got the number 1000000, you could pass that to your transformer. Each transformer could assign the next available number from there. But, if you have 16 simultaneous nodes doing parallel assignment, you have to use the node number in the assignment expression to make sure each node is assigning a unique number. You will have some unused numbers, but this is the general theory.
However, your situation is two or more independent job streams doing surrogate key assignments. My pizza just arrived, so I'll continue after that.![Laughing :lol:](./images/smilies/icon_lol.gif)
First of all, you have to define more clearly the type of loads that are taking place. Do you have data in two different processes that could generate the same row, so the first in is the insert and the second is the update? Or, is your data fully discrete and you simply need a unique identifier?
Database performance sucks if you are using a database to assign a surrogate key, using something like a serial column or a sequence generator. This is because you have the maintenance of creating the sequence, and this also puts a congestion point around the sequence as multiple parallel processes within PX will be fighting over it.
Your preferred choice should be to select the maximum surrogate key in use, and then have your jobstream start assignments from there. Since each node has an identifier, you have to figure how in PX to pass the starting surrogate value to each partitioned process. In Server jobs this is quite easy, as you use job instances and you have job parameters at that level. But in PX, you will need each transformer to be aware of the seed value and then what is the node number. For example, if you selected MAX on your table, got the number 1000000, you could pass that to your transformer. Each transformer could assign the next available number from there. But, if you have 16 simultaneous nodes doing parallel assignment, you have to use the node number in the assignment expression to make sure each node is assigning a unique number. You will have some unused numbers, but this is the general theory.
However, your situation is two or more independent job streams doing surrogate key assignments. My pizza just arrived, so I'll continue after that.
![Laughing :lol:](./images/smilies/icon_lol.gif)
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
I have loads from two separate companies and hence they are totally independent. So my requirement is to generate unique numbers for each record. But the target table is same and I have three separate jobs doing the inserts to the target table simultaneously. Hence they should not get the same sequence number. Each of them has to get a unique value..
I have used the technique of selecting the max value and incrementing the value in transformer earlier in parallel job. There I have set the property of that particular transformer stage to run sequentially rather than parallel so that the generated sequence number was unique and I need not have to worry about the nodes. But here the problem is I have two separate jobs!!
I have used the technique of selecting the max value and incrementing the value in transformer earlier in parallel job. There I have set the property of that particular transformer stage to run sequentially rather than parallel so that the generated sequence number was unique and I need not have to worry about the nodes. But here the problem is I have two separate jobs!!
So this is solely for SK generation, and the numbers need only be unique?
Here's one way; for two jobs, assign one the range (0..maxint/2), the other the range (maxint/2+1..maxint). Get the maxSK in each range for each job before starting. 3 jobs, 4, assign as needed.
Remember to detect the upper SK limit and barf with extreme prejudice.
Carter
Here's one way; for two jobs, assign one the range (0..maxint/2), the other the range (maxint/2+1..maxint). Get the maxSK in each range for each job before starting. 3 jobs, 4, assign as needed.
Remember to detect the upper SK limit and barf with extreme prejudice.
Carter
Okay, well, enough pizza for now.
Your problem is that you cannot select max(surrogate_key) from table because another job could be in the middle of processing data.
I had another client with a billion row table that also had multiple, independent, jobs that needed to load the fact table. What we did is create a table to hold the last surrogate key for each table used. For example, the column TABLE_NAME was the primary key and column LAST_SKEY contained the max value for that table. Now, any time a job stream needed to assign surrogates for its dataset, the seed value was retrieved from this table. You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.
What this does is jump the sequencing over the row count of the set you are going to process. It allows you to assign none, small, or all of the values in that range. You now have multiple job streams that can simultaneously run, but the only serialized portion of the process is the 1 second at startup where the seed value is locked, fetched, updated, and released.
Now, you just have to get over having gaps in your sequencing. BUT THIS IS ABSOLUTELY ACCEPTABLE! Surrogate keys are meaningless, they have no inherent value to the end user. Do not attach meaning to them, especially from a gap perspective.
I like this table approach over sequences because the table is a more visible object and is easier to maintain.
Your problem is that you cannot select max(surrogate_key) from table because another job could be in the middle of processing data.
I had another client with a billion row table that also had multiple, independent, jobs that needed to load the fact table. What we did is create a table to hold the last surrogate key for each table used. For example, the column TABLE_NAME was the primary key and column LAST_SKEY contained the max value for that table. Now, any time a job stream needed to assign surrogates for its dataset, the seed value was retrieved from this table. You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.
What this does is jump the sequencing over the row count of the set you are going to process. It allows you to assign none, small, or all of the values in that range. You now have multiple job streams that can simultaneously run, but the only serialized portion of the process is the 1 second at startup where the seed value is locked, fetched, updated, and released.
Now, you just have to get over having gaps in your sequencing. BUT THIS IS ABSOLUTELY ACCEPTABLE! Surrogate keys are meaningless, they have no inherent value to the end user. Do not attach meaning to them, especially from a gap perspective.
I like this table approach over sequences because the table is a more visible object and is easier to maintain.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Ken, your solution seems quite good and would perform quickly I think. Just one question, when you say lock the row and read, are you talking about something like a 'select for update' or something else (I guess the syntax/mechanism depends on db flavour)? Obviously if you don't lock there is a chance that another process may update in between the 1st process' read/update.kcbland wrote:You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.
like a tiger
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
How about a two-part identifier?
The first part could be simply "A" and "B" (or similar).
The second part can then be a simple sequence in each case.
Or am I missing something here?
The first part could be simply "A" and "B" (or similar).
The second part can then be a simple sequence in each case.
Or am I missing something here?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.