Hi,
I have a job which need to pick the value of the latest time stamp.
value_Column Timestamp
1, 2007-03-21 00:00:00
1, 2007-01-01 00:00:00
1, 2007-02-05 00:00:00
2, 2007-03-03 00:00:00
2, 2007-05-05 00:00:00
2, 2007-05-05 00:00:00
2, 2007-01-11 00:00:00
4, 2007-05-21 00:00:00
4, 2007-02-22 00:00:00
4, 2007-01-01 00:00:00
while reading the above source records I'm doing the hash partion on Timestamp and value_column and doing sort on the same keys and same order and then in remove duplicate stage maintaining the same partition and retaining the first record.
I'm expectind the below result :
1, 2007-03-21 00:00:00
2, 2007-05-05 00:00:00
4, 2007-05-21 00:00:00
In my actual job I'm getting 2944 source records.
After remove duplicate stage I'm getting 123 records but I shoul get only unique records which are 25.
Some how I'm getting duplicates.
Can some please help how to capture only unique records.
Thanks
Suresh
Partition of data and remove duplicates
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 612
- Joined: Thu May 03, 2007 4:59 am
- Location: Melbourne
You shouldn't be hash partitioning on both Timestamp and value_column. If you do so, according to your data for n distinct records you will have n partitions which is not your requirement. You just have to partition only on value_column. In the sort stage you can specify value_column as 'Don't sort, previously sorted' and sort descending only on Timestamp column. Now in remove duplicate stage you can retain the first record.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
-
- Premium Member
- Posts: 783
- Joined: Mon Jan 16, 2006 10:17 pm
- Location: Sydney, Australia
you need to partition on the keys not the values.
its exactly same as
select value_Column, max(Timestamp) for <source>
group by value_Column.
so you have to group same value_Column data in same partition, and then get the letest timestamp, using sort or any other stage
you can try aggregator stage too.
its exactly same as
select value_Column, max(Timestamp) for <source>
group by value_Column.
so you have to group same value_Column data in same partition, and then get the letest timestamp, using sort or any other stage
you can try aggregator stage too.