Parallel Sort and RemoveDuplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ag_ram
Premium Member
Premium Member
Posts: 524
Joined: Wed Feb 28, 2007 3:51 am

Parallel Sort and RemoveDuplicates

Post by ag_ram »

Hello Folks

The following are the stages in a Job

DataSet--------->Sort ---------->RemoveDuplicates------------>DataSet

There are 6 keys and 1 value column , I need the lease value column among these 6 keys and unique keys , so the input to the Sort is a HashPartition on the 6 keys and 1 value column , and on the remove duplicate it is repartition on the 6 keys that i want .

The volume of records in the input to sort can be huge ( upto 6 million) . Given this fact is repartioning suggested. Is there an alternative ?
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Along with keys, sort value column ascending and remove duplicates in sort stage itself. What ever you were trying to achive by repartioning and remove dulicate stage can be done in sort stage itself. Why not try that way?
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Repartitioning is totally irrelevant to data volume. Who suggested it?

Repartitioning is added cost that should be avoided unless absolutely necessary, for example to achieve key or group adjacency or to match database (DB2) partitioning. Repartitioning is particularly costly on cluster or grid configurations, since the records need to be transferred across the network (rather than through shared memory) to their new partitions.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Luciana
Participant
Posts: 60
Joined: Fri Jun 10, 2005 7:22 am
Location: Brasil

Post by Luciana »

Change the option Allow Duplicates in Sort stage to False for remove duplications.

DataSet--------->Sort ---------->DataSet
Post Reply