Generate seq number in transformer only when the key changes

Nsg · Post by **Nsg** » Wed Mar 20, 2013 12:44 pm

Hello everyone,

Please help me achieve in datastage a looping logic to increment a counter when the values in the loop changes.
This can be very easily done in PL Sql or cobol but I need to do in datastage and donot know if datastage has good looping capabilities.

Here is an exaple of my input -

key1 Key2 Value
001 001 A
001 001 B
001 002 A
001 002 B
001 002 A

expected output -

key1 Key2 Value Generated_Key
001 001 A Key1 (as key1+Key2 has a new value)
001 001 B Key1 (as key1+Key2 doesnot have a new value)
001 002 A Key2 (as key1+Key2 has a new value)
001 002 B Key2 (as key1+Key2 doesnot have a new value)
001 002 A Key2 (as key1+Key2 doesnot have a new value)

Thanks,
nsg

s

prasson_ibm · Post by **prasson_ibm** » Wed Mar 20, 2013 1:25 pm

Hi,
You need to first sort based on key1 and create key change column say "keychane1" then again sort on key2 and create key change column say "changekey2",then apply below logic in transformer stage variable stage:-

Counter= if((keychange1=1 and keychange2=1) or (keychange1=0 and keychange2=1)) then counter+1 else counter

priyadarshikunal · Post by **priyadarshikunal** » Wed Mar 20, 2013 1:36 pm

or simply sort on both keys and apply the logic using stage variables. Make sure your data is correctly partitioned and sorted.

chulett · Post by **chulett** » Wed Mar 20, 2013 1:36 pm

You should be able to create the key change column over both keys at once since the number should be incremented if either of them change. The counter derivation becomes simpler then:

Code: Select all

Counter: If KeyChange then Counter+1 else Counter

'KeyChange' is the name of the key change column generated by the Sort stage and can be treated as a Boolean. Make sure your initial value of the Counter stage variable is set to zero.

And, of course, handle any partitioning if running on multiple nodes.

Nsg · Post by **Nsg** » Wed Mar 20, 2013 2:38 pm

Thanks for the response.
What you are saying will work if teh job is run on single node.
But with multiple partitions, it is not working correcttly -

Here are some tests that I had done -
svkeyNew=If Keychange=1 Then (OldKey+PartNum+NumPart) Else svOldKey
svGeneratedKey = svKeyNew
svOldKey = svKeyNew
This works for PartNum = 0 but not the other PartNum

In the above -
svKeyNew=If Keychange=1 Then (PartNum+(NumPart*(svOldKey-1)) +1) Else svOldKey
This works for all partitions but PartNum = 0

I have tried many such combinations as well, but in vain.

I can achieve what I want by using the If statement to condition the 1st row i.e. when svOldKey is null .. but then that means running that additional check for all the records that pass through the transformer. ( I am dealign with 70+ million rows and want to keep the logic as minimum as possible for effecient use of the transformer).

Please help find an effecient way of doing this.

Thansk,
nsg

ray.wurlod · Post by **ray.wurlod** » Wed Mar 20, 2013 3:12 pm

It will work on multiple nodes, provided that the data are partitioned using a key-based algorithm (hash or modulus) on the key being used in change detection, or at least the first column thereof.

jwiles · Post by **jwiles** » Wed Mar 20, 2013 3:58 pm

svKey Initial value: @PARTITIONNUM
svKey derivation: if keyChange=1 then svKey+@NUMPARTITIONS else svKey

Assuming you have properly partitioned and sorted your data, each combination of Key1 and Key2 values will be given a unique generated key value.

Optionally, you could just use the surrogate key functionality built into the transformer since v8.0

Regards,

jwiles · Post by **jwiles** » Wed Mar 20, 2013 5:53 pm

prasson_ibm wrote:To avoid any repartition.

No need.

The engine will only insert a partition in front of a stage which would require it (sorts, joins, merges, lookups and other key-requiring stages), if it feels that it's required. Won't happen with a transformer except in certain situations (such as if it's following a seqfile stage running in sequential mode--then you will typically see a RoundRobin partitioner inserted).

In this situation, the data wouldn't be repartitioned unless the developer explicitly coded it to do so.

Regards,

Nsg · Post by **Nsg** » Thu Mar 21, 2013 1:43 am

Thanks everyone for the response.

I am currently coding with sort using the key fields and hash partitioned. Keychange column is also generated.
The transformer is set to same partition as well.

The logic for genrating a sequence number only when the key value changes is coded with a If then else statement to fisrt row separately and all the following rows differently.
However, I was hoping to find a way to avaoid the use of If Then Else statment. Since this condition will be correct for 1st row but false for all the following 70 + million rows.

It is always fun to know if there are other better ways of getting to the desired result. So thanks for all your responses.

Thanks,
Neha

chulett · Post by **chulett** » Thu Mar 21, 2013 7:40 am

What makes you think you need to handle the first row differently?

Nsg · Post by **Nsg** » Thu Mar 21, 2013 8:34 am

I have tried many versions of partnum, numpart, counter etc but have not been able to generate a rule that will handle first row and all teh subsequent rows for all the partitions.

Just try the two examples I have provided in my earlier email considerign there are 2partitions - num - 0 and num - 1 and generate 4 keys and you will kno wwhat I mean.

Thanks,
nsg

jwiles · Post by **jwiles** » Thu Mar 21, 2013 9:10 am

Explain what you mean by "handle first row". What do you think you need to do differently with it? Is your requirement to start with a particular value for the generated key, or just to generate unique values?

If the former, you could modify the Initial Value logic in my earlier post:

Code: Select all

svKey Initial Value:  @PARTITIONNUM * -1

and you will start of with the value 0 for the generated key, without additional per-row logic to see if it's the first row.

Regards,

chulett · Post by **chulett** » Thu Mar 21, 2013 9:10 am

Two things. One is I can't try squat as I haven't had DataStage access for 3 years now, having moved over to the Dark Side. Second, doesn't the dang product allow you to set the initial value of a stage variable yet on the Parallel side? Been there since Day One for Server and being able to do so would solve your problem, hence my question.

jwiles · Post by **jwiles** » Thu Mar 21, 2013 9:13 am

Been able to do that as long as I've been doing parallel (7.0)

chulett · Post by **chulett** » Thu Mar 21, 2013 9:14 am

Good... then I don't see the need to play first row games and I see that you've posted to that effect. Thanks.