Hi there
Hope your day is going well.
I've got a Sequential File with a column that I'm wanting to remove, and then get a Distinct across the remaining columns.
i.e. Data looks like (4 columns, space is delimiter)
ColA ColB ColC Prod
1001 A 1 Prod1
1001 A 1 Prod2
1001 A 1 Prod3
1001 A 1 Prod4
I'm wanting to remove the Prod column, and get a distinct listing of ColA, ColB, and ColC, order ascending on all columns.
i.e. result should look like
1001 A 1
Just so there's no confusion, the SQL used to get the output is:
SELECT distinct ColA, ColB, ColC
FROM <table>
ORDER BY ColA, ColB, ColC
The DS job that I built is structured:
SeqFile_Source -> Link1 -> Transformation (remove Prod) -> Link2 -> SeqFile_Target
SeqFile_Source contains the listing as shown.
The Transformation will only output the first columns from the list.
SeqFile_Target will have Partition Type set to Sort Merge, Perform Sort, Unique
And yet the output that I'm getting looks like
1001 A 1
1001 A 1
I'm running the job across two Partitions on the Server, which would explain why I get two rows (one per partition).
What I can't understand is why the Unique option isn't working.
The SeqFile_Target is set to output Seqentially (Execution Mode), Perform Sort has been set with all three Columns (ColA, ColB, ColC), all set to sort Ascending, and all columns set to both Sort and Collect.
I can solve this issue by having a RemoveDuplicates Stage before the SeqFile_Target.
Howeve, I would like to know why the Unique Option doesn't work across Partitions.
For my own understanding, a better question would be what kind of operations are possible when moving data from a Transformation to a Sequential file (no Stages in between) with the data in Partitions?
Any feedback would be appreciated.
Merge Sort, Unique. Replace RemoveDuplicates. Doesn't work?
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Partitions are independent. Uniqueness across partitions is not possible. That's why you need to partition your data on the same keys that you need to group, sort or remove duplicates.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
But isn't the effect of applying a "Sort Merge" on the Sequential file taking multiple partitions and merging them into one single list before ordering that list?
At the end of the "Sort Merge", I'd expect a sorted list (with possibly repeated rows).
Does the Unique option on the same Sequential file not make the above list unique?
The Job isn't working as it should, so it's obvious that my perception of "how it should" work isn't right.
The "Sort Merge" appears to work, but the Unique options doesn't.
Any ideas?
At the end of the "Sort Merge", I'd expect a sorted list (with possibly repeated rows).
Does the Unique option on the same Sequential file not make the above list unique?
The Job isn't working as it should, so it's obvious that my perception of "how it should" work isn't right.
The "Sort Merge" appears to work, but the Unique options doesn't.
Any ideas?
I believed your description of the expected behaviour to be correct and can confirm, having tried it, that it is.
I have a Row Generator -> Peek -> Seq File
The Sort Merge collector will produce an output sorted on the keys you specify in the Sequential File (All 3 need to have usage Sorting, Collecting).
There are 10 rows with 3 different combinations of key values. When Unique is selected I get 3 rows in the output, regardless of how many nodes I run on.
What fields have you specified for the Sort Merge?
I have a Row Generator -> Peek -> Seq File
The Sort Merge collector will produce an output sorted on the keys you specify in the Sequential File (All 3 need to have usage Sorting, Collecting).
There are 10 rows with 3 different combinations of key values. When Unique is selected I get 3 rows in the output, regardless of how many nodes I run on.
What fields have you specified for the Sort Merge?