Reset Counter based on Types

sohasaid · Post by **sohasaid** » Tue Dec 04, 2012 5:44 am

I had a requirement to create a counter and it should be reset at every different type, as follows:

Type, Counter
A, 1
A, 2
A, 3
B, 1
B, 2
C, 1
C, 2

I found some difficulties at the beginning because I need to keep the parallel execution mode and job runs on 12 nodes.

But the case has solved once I just defined an auto increment stage variable 'StgVar' with default value '0' with derivation 'StgVar+1'.

What I don't understand and need your help to explain is how it has worked without specifying any other logic and how the counter got reset after each type?!

Job Design:
DataSet --> transformer --> DataSet
Notes: Input data is sorted based on type and all jobs have 'parallel' execution modes.

I've attached the dataset part of the job score:

Code: Select all

main_program: This step has 3 datasets:
ds0: {/tmp/test1.ds
      eAny=>eCollectAny
      op2[12p] (parallel APT_CombinedOperatorController:Data_Set_26)}
ds1: {op0[12p] (parallel delete data files in delete /tmp/teeeta.ds)
      >>eCollectAny
      op1[1p] (sequential delete descriptor file in delete /tmp/teeeta.ds)}
ds2: {op2[12p] (parallel APT_CombinedOperatorController:Data_Set_29)
      [pp] =>
      /tmp/teeeta.ds}

Regards.

Mike · Post by **Mike** » Tue Dec 04, 2012 7:33 am

Contrary to what you believe , it didn't work. Try a sufficiently large enough test dataset.

Mike

nagarjuna · Post by **nagarjuna** » Tue Dec 04, 2012 2:30 pm

Mike ,

I think , The source ( input dataset ) is already partitioned & sorted on the key column...So it is generating sequence number based on the type correctly .

Mike · Post by **Mike** » Tue Dec 04, 2012 2:46 pm

Try it with 13 types or run it on a 2-node configuration and it'll be quite obvious what the problem is...

Mike

sohasaid · Post by **sohasaid** » Wed Dec 05, 2012 9:28 am

Mike wrote:Try a sufficiently large enough test dataset.

Thank you Mike and Nag for reply.

You're right, Mike. I've tried with 1 million records into database table and it didn't work.

Now how do you think I could achieve the requirement? (i.e. reset a counter at every new type with keeping the parallel execution mode?!)

Mike · Post by **Mike** » Wed Dec 05, 2012 9:49 am

This is a very common design pattern and one that has been discussed a whole lot on DSXchange.

One option is to use stage variables to detect a key change and reset a counter.

Another option is to use the sort stage to add a key change indicator to the row and use that to reset a counter.

And of course, since you are performing a key-based operation, you must ensure that your data are partitioned and sorted by that key.

Mike

sohasaid · Post by **sohasaid** » Sun Dec 09, 2012 5:01 am

Mike wrote:Another option is to use the sort stage to add a key change indicator to the row and use that to reset a counter.

Thanks Mike. I've used this approach and it's worked .

1.Sort and partition input data based on type column
2. Generate keychange column from the sort stage.
3. Create an integer stage variable with '0' default value using this derivation:
If DSLink35.keyChange = 0 Then StageVar + 1 Else 1

Thanks again.