Merge Sort, Unique. Replace RemoveDuplicates. Doesn't work?

JonJon · Post by **JonJon** » Thu Dec 06, 2007 10:52 pm

Hi there

Hope your day is going well.

I've got a Sequential File with a column that I'm wanting to remove, and then get a Distinct across the remaining columns.

i.e. Data looks like (4 columns, space is delimiter)
ColA ColB ColC Prod
1001 A 1 Prod1
1001 A 1 Prod2
1001 A 1 Prod3
1001 A 1 Prod4

I'm wanting to remove the Prod column, and get a distinct listing of ColA, ColB, and ColC, order ascending on all columns.

i.e. result should look like
1001 A 1

Just so there's no confusion, the SQL used to get the output is:
SELECT distinct ColA, ColB, ColC
FROM <table>
ORDER BY ColA, ColB, ColC

The DS job that I built is structured:

SeqFile_Source -> Link1 -> Transformation (remove Prod) -> Link2 -> SeqFile_Target

SeqFile_Source contains the listing as shown.

The Transformation will only output the first columns from the list.

SeqFile_Target will have Partition Type set to Sort Merge, Perform Sort, Unique

And yet the output that I'm getting looks like
1001 A 1
1001 A 1

I'm running the job across two Partitions on the Server, which would explain why I get two rows (one per partition).

What I can't understand is why the Unique option isn't working.

The SeqFile_Target is set to output Seqentially (Execution Mode), Perform Sort has been set with all three Columns (ColA, ColB, ColC), all set to sort Ascending, and all columns set to both Sort and Collect.

I can solve this issue by having a RemoveDuplicates Stage before the SeqFile_Target.

Howeve, I would like to know why the Unique Option doesn't work across Partitions.

For my own understanding, a better question would be what kind of operations are possible when moving data from a Transformation to a Sequential file (no Stages in between) with the data in Partitions?

Any feedback would be appreciated.

ray.wurlod · Post by **ray.wurlod** » Fri Dec 07, 2007 1:00 am

Partitions are independent. Uniqueness across partitions is not possible. That's why you need to partition your data on the same keys that you need to group, sort or remove duplicates.

JonJon · Post by **JonJon** » Sun Dec 09, 2007 11:06 pm

But isn't the effect of applying a "Sort Merge" on the Sequential file taking multiple partitions and merging them into one single list before ordering that list?

At the end of the "Sort Merge", I'd expect a sorted list (with possibly repeated rows).

Does the Unique option on the same Sequential file not make the above list unique?

The Job isn't working as it should, so it's obvious that my perception of "how it should" work isn't right.

The "Sort Merge" appears to work, but the Unique options doesn't.

Any ideas?

balajisr · Post by **balajisr** » Mon Dec 10, 2007 12:53 am

Can you have a peek stage executing in sequential mode instead of the target sequential file and check the results?

thompsonp · Post by **thompsonp** » Fri Dec 14, 2007 5:39 am

I believed your description of the expected behaviour to be correct and can confirm, having tried it, that it is.

I have a Row Generator -> Peek -> Seq File

The Sort Merge collector will produce an output sorted on the keys you specify in the Sequential File (All 3 need to have usage Sorting, Collecting).
There are 10 rows with 3 different combinations of key values. When Unique is selected I get 3 rows in the output, regardless of how many nodes I run on.

What fields have you specified for the Sort Merge?