Partition of data and remove duplicates

snt_ds · Post by **snt_ds** » Sun Jun 24, 2007 3:47 pm

Hi,

I have a job which need to pick the value of the latest time stamp.
value_Column Timestamp
1, 2007-03-21 00:00:00
1, 2007-01-01 00:00:00
1, 2007-02-05 00:00:00
2, 2007-03-03 00:00:00
2, 2007-05-05 00:00:00
2, 2007-05-05 00:00:00
2, 2007-01-11 00:00:00
4, 2007-05-21 00:00:00
4, 2007-02-22 00:00:00
4, 2007-01-01 00:00:00

while reading the above source records I'm doing the hash partion on Timestamp and value_column and doing sort on the same keys and same order and then in remove duplicate stage maintaining the same partition and retaining the first record.

I'm expectind the below result :
1, 2007-03-21 00:00:00
2, 2007-05-05 00:00:00
4, 2007-05-21 00:00:00

In my actual job I'm getting 2944 source records.
After remove duplicate stage I'm getting 123 records but I shoul get only unique records which are 25.
Some how I'm getting duplicates.

Can some please help how to capture only unique records.

Thanks
Suresh

JoshGeorge · Post by **JoshGeorge** » Sun Jun 24, 2007 6:07 pm

You shouldn't be hash partitioning on both Timestamp and value_column. If you do so, according to your data for n distinct records you will have n partitions which is not your requirement. You just have to partition only on value_column. In the sort stage you can specify value_column as 'Don't sort, previously sorted' and sort descending only on Timestamp column. Now in remove duplicate stage you can retain the first record.

keshav0307 · Post by **keshav0307** » Sun Jun 24, 2007 7:39 pm

you need to partition on the keys not the values.

its exactly same as

select value_Column, max(Timestamp) for <source>
group by value_Column.

so you have to group same value_Column data in same partition, and then get the letest timestamp, using sort or any other stage

you can try aggregator stage too.