Partition of data and remove duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
snt_ds
Premium Member
Premium Member
Posts: 280
Joined: Wed Oct 18, 2006 11:53 am
Location: Pune, India

Partition of data and remove duplicates

Post by snt_ds »

Hi,

I have a job which need to pick the value of the latest time stamp.
value_Column Timestamp
1, 2007-03-21 00:00:00
1, 2007-01-01 00:00:00
1, 2007-02-05 00:00:00
2, 2007-03-03 00:00:00
2, 2007-05-05 00:00:00
2, 2007-05-05 00:00:00
2, 2007-01-11 00:00:00
4, 2007-05-21 00:00:00
4, 2007-02-22 00:00:00
4, 2007-01-01 00:00:00

while reading the above source records I'm doing the hash partion on Timestamp and value_column and doing sort on the same keys and same order and then in remove duplicate stage maintaining the same partition and retaining the first record.

I'm expectind the below result :
1, 2007-03-21 00:00:00
2, 2007-05-05 00:00:00
4, 2007-05-21 00:00:00

In my actual job I'm getting 2944 source records.
After remove duplicate stage I'm getting 123 records but I shoul get only unique records which are 25.
Some how I'm getting duplicates.

Can some please help how to capture only unique records.

Thanks
Suresh
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

You shouldn't be hash partitioning on both Timestamp and value_column. If you do so, according to your data for n distinct records you will have n partitions which is not your requirement. You just have to partition only on value_column. In the sort stage you can specify value_column as 'Don't sort, previously sorted' and sort descending only on Timestamp column. Now in remove duplicate stage you can retain the first record.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

you need to partition on the keys not the values.

its exactly same as

select value_Column, max(Timestamp) for <source>
group by value_Column.

so you have to group same value_Column data in same partition, and then get the letest timestamp, using sort or any other stage

you can try aggregator stage too.
Post Reply