Generate random/hash identifier

dsuser1 · Post by **dsuser1** » Mon Dec 01, 2003 11:20 am

I need to generate a random/hash identifier for the incoming key fields, that can be used as a unique identifier on database.

Does Ascential has any function or stage which will create a unique hash values for input?

Thanks in advance.

Teej · Post by **Teej** » Mon Dec 01, 2003 11:35 am

Stage Variables in transform stage.

Create a field, with the default value to whatever you want.

Then do a value of the field name + 1.

Every row will spit out an increasing number.

-T.J.

Teej · Post by **Teej** » Mon Dec 01, 2003 11:39 am

A cavet with this solution - partitioning. You will need to run this stage sequentially, not in parallel.

-T.J.

dsuser1 · Post by **dsuser1** » Mon Dec 01, 2003 11:54 am

if I don't want to put one up sequence number, is there a function (hash function) in datasatge which can give me a unique value?

dsuser1 · Post by **dsuser1** » Mon Dec 01, 2003 12:47 pm

One more reason why I want a unique hash generator is I need to run more than one job which simultaneously inserts into the table(data from two different sources). If I use hash generator this will not be a problem. But if I use one up sequence then there will be duplicates. Can anybody throw some light on this as which are all the options available in datastage for this?

Thanks much!

mhester · Post by **mhester** » Mon Dec 01, 2003 2:34 pm

Ken Bland has spoken about this very thing via this forum (and given examples) many times so you should search the forum for posts by Ken. His solutions are simple and elegant for handling exactly what you want to do.

Regards,

Michael Hester

kcbland · Post by **kcbland** » Mon Dec 01, 2003 5:26 pm

Anytime you have multiple independent processes creating output for the same target table, you will have to deal with surrogate key assignment.

First of all, you have to define more clearly the type of loads that are taking place. Do you have data in two different processes that could generate the same row, so the first in is the insert and the second is the update? Or, is your data fully discrete and you simply need a unique identifier?

Database performance sucks if you are using a database to assign a surrogate key, using something like a serial column or a sequence generator. This is because you have the maintenance of creating the sequence, and this also puts a congestion point around the sequence as multiple parallel processes within PX will be fighting over it.

Your preferred choice should be to select the maximum surrogate key in use, and then have your jobstream start assignments from there. Since each node has an identifier, you have to figure how in PX to pass the starting surrogate value to each partitioned process. In Server jobs this is quite easy, as you use job instances and you have job parameters at that level. But in PX, you will need each transformer to be aware of the seed value and then what is the node number. For example, if you selected MAX on your table, got the number 1000000, you could pass that to your transformer. Each transformer could assign the next available number from there. But, if you have 16 simultaneous nodes doing parallel assignment, you have to use the node number in the assignment expression to make sure each node is assigning a unique number. You will have some unused numbers, but this is the general theory.

However, your situation is two or more independent job streams doing surrogate key assignments. My pizza just arrived, so I'll continue after that.

dsuser1 · Post by **dsuser1** » Mon Dec 01, 2003 6:19 pm

I have loads from two separate companies and hence they are totally independent. So my requirement is to generate unique numbers for each record. But the target table is same and I have three separate jobs doing the inserts to the target table simultaneously. Hence they should not get the same sequence number. Each of them has to get a unique value..

I have used the technique of selecting the max value and incrementing the value in transformer earlier in parallel job. There I have set the property of that particular transformer stage to run sequentially rather than parallel so that the generated sequence number was unique and I need not have to worry about the nodes. But here the problem is I have two separate jobs!!

clshore · Post by **clshore** » Mon Dec 01, 2003 6:37 pm

So this is solely for SK generation, and the numbers need only be unique?
Here's one way; for two jobs, assign one the range (0..maxint/2), the other the range (maxint/2+1..maxint). Get the maxSK in each range for each job before starting. 3 jobs, 4, assign as needed.
Remember to detect the upper SK limit and barf with extreme prejudice.

Carter

kcbland · Post by **kcbland** » Mon Dec 01, 2003 6:42 pm

Okay, well, enough pizza for now.

Your problem is that you cannot select max(surrogate_key) from table because another job could be in the middle of processing data.

I had another client with a billion row table that also had multiple, independent, jobs that needed to load the fact table. What we did is create a table to hold the last surrogate key for each table used. For example, the column TABLE_NAME was the primary key and column LAST_SKEY contained the max value for that table. Now, any time a job stream needed to assign surrogates for its dataset, the seed value was retrieved from this table. You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.

What this does is jump the sequencing over the row count of the set you are going to process. It allows you to assign none, small, or all of the values in that range. You now have multiple job streams that can simultaneously run, but the only serialized portion of the process is the 1 second at startup where the seed value is locked, fetched, updated, and released.

Now, you just have to get over having gaps in your sequencing. BUT THIS IS ABSOLUTELY ACCEPTABLE! Surrogate keys are meaningless, they have no inherent value to the end user. Do not attach meaning to them, especially from a gap perspective.

I like this table approach over sequences because the table is a more visible object and is easier to maintain.

rabs · Post by **rabs** » Tue Dec 02, 2003 3:17 am

kcbland wrote:You'd lock the row for the table you're interested in, read the value, and update back into that row the value plus the row count of the dataset you're about to process, then release the lock.

Ken, your solution seems quite good and would perform quickly I think. Just one question, when you say lock the row and read, are you talking about something like a 'select for update' or something else (I guess the syntax/mechanism depends on db flavour)? Obviously if you don't lock there is a chance that another process may update in between the 1st process' read/update.

kcbland · Post by **kcbland** » Tue Dec 02, 2003 8:12 am

You've got it correct.

ray.wurlod · Post by **ray.wurlod** » Tue Dec 02, 2003 10:42 am

How about a two-part identifier?
The first part could be simply "A" and "B" (or similar).
The second part can then be a simple sequence in each case.
Or am I missing something here?

DSXchange

Generate random/hash identifier

Generate random/hash identifier

Re: Generate random/hash identifier