Merge Sort, Unique. Replace RemoveDuplicates. Doesn't work?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
JonJon
Participant
Posts: 3
Joined: Mon Dec 03, 2007 7:49 pm

Merge Sort, Unique. Replace RemoveDuplicates. Doesn't work?

Post by JonJon »

Hi there

Hope your day is going well.

I've got a Sequential File with a column that I'm wanting to remove, and then get a Distinct across the remaining columns.

i.e. Data looks like (4 columns, space is delimiter)
ColA ColB ColC Prod
1001 A 1 Prod1
1001 A 1 Prod2
1001 A 1 Prod3
1001 A 1 Prod4

I'm wanting to remove the Prod column, and get a distinct listing of ColA, ColB, and ColC, order ascending on all columns.

i.e. result should look like
1001 A 1

Just so there's no confusion, the SQL used to get the output is:
SELECT distinct ColA, ColB, ColC
FROM <table>
ORDER BY ColA, ColB, ColC

The DS job that I built is structured:

SeqFile_Source -> Link1 -> Transformation (remove Prod) -> Link2 -> SeqFile_Target

SeqFile_Source contains the listing as shown.

The Transformation will only output the first columns from the list.

SeqFile_Target will have Partition Type set to Sort Merge, Perform Sort, Unique

And yet the output that I'm getting looks like
1001 A 1
1001 A 1

I'm running the job across two Partitions on the Server, which would explain why I get two rows (one per partition).

What I can't understand is why the Unique option isn't working.

The SeqFile_Target is set to output Seqentially (Execution Mode), Perform Sort has been set with all three Columns (ColA, ColB, ColC), all set to sort Ascending, and all columns set to both Sort and Collect.

I can solve this issue by having a RemoveDuplicates Stage before the SeqFile_Target.

Howeve, I would like to know why the Unique Option doesn't work across Partitions.

For my own understanding, a better question would be what kind of operations are possible when moving data from a Transformation to a Sequential file (no Stages in between) with the data in Partitions?

Any feedback would be appreciated.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Partitions are independent. Uniqueness across partitions is not possible. That's why you need to partition your data on the same keys that you need to group, sort or remove duplicates.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JonJon
Participant
Posts: 3
Joined: Mon Dec 03, 2007 7:49 pm

Post by JonJon »

But isn't the effect of applying a "Sort Merge" on the Sequential file taking multiple partitions and merging them into one single list before ordering that list?

At the end of the "Sort Merge", I'd expect a sorted list (with possibly repeated rows).

Does the Unique option on the same Sequential file not make the above list unique?

The Job isn't working as it should, so it's obvious that my perception of "how it should" work isn't right.

The "Sort Merge" appears to work, but the Unique options doesn't.

Any ideas?
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Can you have a peek stage executing in sequential mode instead of the target sequential file and check the results?
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

I believed your description of the expected behaviour to be correct and can confirm, having tried it, that it is.

I have a Row Generator -> Peek -> Seq File

The Sort Merge collector will produce an output sorted on the keys you specify in the Sequential File (All 3 need to have usage Sorting, Collecting).
There are 10 rows with 3 different combinations of key values. When Unique is selected I get 3 rows in the output, regardless of how many nodes I run on.

What fields have you specified for the Sort Merge?
Post Reply