sort stage followed by remove duplicates stage

pillip · Post by **pillip** » Wed Oct 01, 2014 8:19 pm

Hi,

we are using sort stage followed by remove duplicates stage in a datastage job.
Hash partioning done on col1,col2,col3 and sorting done on col1,col2 col4 in sort stage. Now in remove duplicates stage removing duplicates on col1,col2,col3. Retaining the first row in remove duplicates stage.

Remove duplicates stage is not working fine. Its once selecting the first row or the last row.

The query is, is it mandatory that the rows having duplicates be side by side for remove duplicates to retain the correct row.

Thanks.

ray.wurlod · Post by **ray.wurlod** » Wed Oct 01, 2014 9:14 pm

Think about what this combination of partitioning and sorting is doing.

By including col3 as a partitioning key, you are placing combinations of (col1,col2) on different partitions, which is causing the duplicates.

Try partitioning by col1 alone or, if it has fewer distinct values than your configuration has nodes, then col1 and col2 only.

pillip · Post by **pillip** » Wed Oct 08, 2014 8:17 pm

Hi Ray,

Current job : Sort stage followed by remove dup stage. In sort stage partition by col1,col2,col3 and then sort by col1,col2 col4. col1 and col2 asc order sort, whereas col4 desc sort. Now in remove dups there is remove dup on col1,col2,col3.

Currently wt is missing in current job is the partioning in remove dup stage is auto. We are guessing that this is causing duplicates.

We plan to change the patitioning to Same and then sort to stable sort in remove duplicate stage and test it out.

Can you let me know your thoughts on this.

Thanks you

ray.wurlod · Post by **ray.wurlod** » Wed Oct 08, 2014 8:29 pm

I would partition only by col1, and sort by col1, col2, col3. Sorting by col4 doesn't achieve anything, unless to govern the meaning of First or Last in Remove Duplicates stage.

For any given value of col1, all values of col1,col2,col3 will occur on the same partition.

Only if there are too few distinct values of col1 would I consider partitioning by col1,col2. In this case, some col1 values may occur on one partition and others on others.

Don't sort in Remove Duplicates stage at all. Provided that the Sort stage is immediately upstream of the Remove Duplicates stage, the framework will not insert any tsort operator.