Generate random/hash identifier

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsuser1
Participant
Posts: 14
Joined: Thu Oct 16, 2003 5:58 pm

Generate random/hash identifier

Post by dsuser1 »

I need to generate a random/hash identifier for the incoming key fields, that can be used as a unique identifier on database.

Does Ascential has any function or stage which will create a unique hash values for input?

Thanks in advance.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Generate random/hash identifier

Post by Teej »

Stage Variables in transform stage.

Create a field, with the default value to whatever you want.

Then do a value of the field name + 1.

Every row will spit out an increasing number.

-T.J.
Last edited by Teej on Mon Dec 01, 2003 11:39 am, edited 1 time in total.
Developer of DataStage Parallel Engine (Orchestrate).
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

A cavet with this solution - partitioning. You will need to run this stage sequentially, not in parallel.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
dsuser1
Participant
Posts: 14
Joined: Thu Oct 16, 2003 5:58 pm

Post by dsuser1 »

if I don't want to put one up sequence number, is there a function (hash function) in datasatge which can give me a unique value?
dsuser1
Participant
Posts: 14
Joined: Thu Oct 16, 2003 5:58 pm

Post by dsuser1 »

One more reason why I want a unique hash generator is I need to run more than one job which simultaneously inserts into the table(data from two different sources). If I use hash generator this will not be a problem. But if I use one up sequence then there will be duplicates. Can anybody throw some light on this as which are all the options available in datastage for this?

Thanks much!
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Ken Bland has spoken about this very thing via this forum (and given examples) many times so you should search the forum for posts by Ken. His solutions are simple and elegant for handling exactly what you want to do.

Regards,

Michael Hester
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Anytime you have multiple independent processes creating output for the same target table, you will have to deal with surrogate key assignment.

First of all, you have to define more clearly the type of loads that are taking place. Do you have data in two different processes that could generate the same row, so the first in is the insert and the second is the update? Or, is your data fully discrete and you simply need a unique identifier?

Database performance sucks if you are using a database to assign a surrogate key, using something like a serial column or a sequence generator. This is because you have the maintenance of creating the sequence, and this also puts a congestion point around the sequence as multiple parallel processes within PX will be fighting over it.

Your preferred choice should be to select the maximum surrogate key in use, and then have your jobstream start assignments from there. Since each node has an identifier, you have to figure how in PX to pass the starting surrogate value to each partitioned process. In Server jobs this is quite easy, as you use job instances and you have job parameters at that level. But in PX, you will need each transformer to be aware of the seed value and then what is the node number. For example, if you selected MAX on your table, got the number 1000000, you could pass that to your transformer. Each transformer could assign the next available number from there. But, if you have 16 simultaneous nodes doing parallel assignment, you have to use the node number in the assignment expression to make sure each node is assigning a unique number. You will have some unused numbers, but this is the general theory.

However, your situation is two or more independent job streams doing surrogate key assignments. My pizza just arrived, so I'll continue after that. :lol:
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
dsuser1
Participant
Posts: 14
Joined: Thu Oct 16, 2003 5:58 pm

Post by dsuser1 »

I have loads from two separate companies and hence they are totally independent. So my requirement is to generate unique numbers for each record. But the target table is same and I have three separate jobs doing the inserts to the target table simultaneously. Hence they should not get the same sequence number. Each of them has to get a unique value..

I have used the technique of selecting the max value and incrementing the value in transformer earlier in parallel job. There I have set the property of that particular transformer stage to run sequentially rather than parallel so that the generated sequence number was unique and I need not have to worry about the nodes. But here the problem is I have two separate jobs!!
clshore
Charter Member
Charter Member
Posts: 115
Joined: Tue Oct 21, 2003 11:45 am

Post by clshore »

So this is solely for SK generation, and the numbers need only be unique?
Here's one way; for two jobs, assign one the range (0..maxint/2), the other the range (maxint/2+1..maxint). Get the maxSK in each range for each job before starting. 3 jobs, 4, assign as needed.
Remember to detect the upper SK limit and barf with extreme prejudice.

Carter
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Okay, well, enough pizza for now.

Your problem is that you cannot select max(surrogate_key) from table because another job could be in the middle of processing data.

I had another client with a billion row table that also had multiple, independent, jobs that needed to load the fact table. What we did is create a table to hold the last surrogate key for each table used. For example, the column TABLE_NAME was the primary key and column LAST_SKEY contained the max value for that table. Now, any time a job stream needed to assign surrogates for its dataset, the seed value was retrieved from this table. You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.

What this does is jump the sequencing over the row count of the set you are going to process. It allows you to assign none, small, or all of the values in that range. You now have multiple job streams that can simultaneously run, but the only serialized portion of the process is the 1 second at startup where the seed value is locked, fetched, updated, and released.

Now, you just have to get over having gaps in your sequencing. BUT THIS IS ABSOLUTELY ACCEPTABLE! Surrogate keys are meaningless, they have no inherent value to the end user. Do not attach meaning to them, especially from a gap perspective.

I like this table approach over sequences because the table is a more visible object and is easier to maintain.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rabs
Charter Member
Charter Member
Posts: 21
Joined: Thu May 02, 2002 5:27 pm

Post by rabs »

kcbland wrote:You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.
Ken, your solution seems quite good and would perform quickly I think. Just one question, when you say lock the row and read, are you talking about something like a 'select for update' or something else (I guess the syntax/mechanism depends on db flavour)? Obviously if you don't lock there is a chance that another process may update in between the 1st process' read/update.
like a tiger
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You've got it correct.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How about a two-part identifier?
The first part could be simply "A" and "B" (or similar).
The second part can then be a simple sequence in each case.
Or am I missing something here?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply