Timestamp sort issue

prasson_ibm · Post by **prasson_ibm** » Fri Dec 02, 2011 6:26 am

Hi All,

I have job design where i need to sort the data based on timestamp column(microsecond) and after sort apply transormation logic in transformer which should pick the max timestamp and populate for rest of records.
Stage veriable value is:-

Code: Select all

If srt_to_tfm.LastChgDateTime>= svEFFDT then srt_to_tfm.LastChgDateTime else svEFFDT

where svEFFDT is initialzed with '2009-07-11 13:17:39.810' value.

Instead of getting max timestamp for all input records,i am getting wrong output.

It seems timestamp column is sorted on each partition and stage veriable is picking that partitions max timestamp value and for next partition some other value,but i want max timestamp value for all input records.

This job is working filr on single node.
Kindly help me with some solution to this issue.

Thanks

jwiles · Post by **jwiles** » Fri Dec 02, 2011 9:17 am

The results you are seeing is exactly how the product works. In order to obtain the maximum value across ALL records, you have two options:

1) Run the transformer itself in sequential mode

2) Split out the timestamps into a separate stream, process them sequentially to get the maximum value (for instance, use an aggregator stage running in sequential mode) and then join the result back to the main data.

Depending upon your data quantity, either option can noticeably impact performance. However, you have already experienced the fact that to get the correct result, the data you're capturing must be processed in a single partition.

Regards,

prasson_ibm · Post by **prasson_ibm** » Fri Dec 02, 2011 1:28 pm

Hi,
Thanks for reply.

I am trying to implement solution2,will keep you updated.

Thanks

prasson_ibm · Post by **prasson_ibm** » Sun Dec 04, 2011 2:58 pm

Hi,

I am planning to use aggregator stage.

I want to take saperate stream,create one dummy column 1 and aggregate data to take max timestamp value.
In this case i ll hash partition on dummy column,so do i need to run aggregator in sequence mode..??

jwiles · Post by **jwiles** » Sun Dec 04, 2011 11:43 pm

You would essentially be performing the same processing as if you just ran the aggregator in sequential mode, except that you are adding additional, unnecessary overhead to your job by adding the dummy column and re-partitioning. Also, this wastes system resources by having multiple instances of the aggregator (it's running in parallel) while only one of them does any work.

Regards,

prasson_ibm · Post by **prasson_ibm** » Mon Dec 05, 2011 4:12 am

Data in the job will be lesser,so better i ll go for option 1.

As suggested i am making transformer to run sequentially and sort stage will remain run in parallel mode.

But do you think that due to partition in the sort we could get wrong sorted sequence in transformer(running in sequence)..??