Use of Sorting and Downstream operators (Remove Duplicate)

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Use of Sorting and Downstream operators (Remove Duplicate)

Post by ShaneMuir »

Hi All

I have a question regarding the use of sort functionality, be it in a sort stage or on the input of another stage, and the resulting error

Code: Select all

User inserted sort "rmv_duplicate_data.lnk_rmv_duplicate_data_Sort" does not fulfill the sort requirements of the downstream operator "rmv_duplicate_data"
From all posts about this topic on this forum it suggests that the sort key has to be the same as the key used downstream. My question is WHY?

To me this requirement seems flawed - I should be able to sort my data on those key fields AND any other fields to ensure that if I wish to retain the first or last record I can get the correct one in instance where you require a secondary key to determine which record should be kept?

eg If my key was an account number and it was followed by a sequence number and I wished to only keep the record with the highest sequence number then I would have to sort by both account number and sequence but deduplicate only on the account number?

Code: Select all

Account          Seq
000001            01
000002            01
000003            01
000001            02
000002            02

Output:
000001            02
000002            02
000003            01
The weirder thing is that up until recently I am sure that this logic actually did work, only recently am I getting errors.

Any suggestions?
John Smith
Charter Member
Charter Member
Posts: 193
Joined: Tue Sep 05, 2006 8:01 pm
Location: Australia

Post by John Smith »

From all posts about this topic on this forum it suggests that the sort key has to be the same as the key used downstream. My question is WHY?

I disagree with those post. What you are doing is fine you can ignore this warning.Bear in mind it's a warning only.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

John Smith wrote:
I disagree with those post. What you are doing is fine you can ignore this warning.Bear in mind it's a warning only.
I agree with you, as I am sure that I have had it working previously. I realise its only a warning and that I could downgrade it informational messages but I guess I would like to know why I get the warning in the first place, considering you'd think it was legitimate to sort by more columns that you would deduplicate by.
John Smith
Charter Member
Charter Member
Posts: 193
Joined: Tue Sep 05, 2006 8:01 pm
Location: Australia

Post by John Smith »

you need to specify what errors you're getting. In terms of the output do you get the correct results? the warning message just means that your deduping on less keys than what you were sorting on which in some cases is legit. but if you are getting incorrect results then it's most likely to do with partioning or someting in the data.
hope this helps.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

From all posts about this topic on this forum it suggests that the sort key has to be the same as the key used downstream. My question is WHY?
The parallel job developers guide says:

"You should hash partition the data using the sort keys as hash keys in order to guarantee that duplicate rows are in the same partition"
OddJob
Participant
Posts: 163
Joined: Tue Feb 28, 2006 5:00 am
Location: Sheffield, UK

Post by OddJob »

I received the exact same warning yesterday. The solution was that...

Although the keys were correct for the Remove Duplicates stage, the sort order was not! It seems the Remove Duplicates stage wants the sort order to be Ascending, not Descending!

Does this match your situation?
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

It could have something to do with Ascending vs Descending, but even if I set the sort order to ascending and select retain last instead of first I still get the error.

I am thinking that it may have something to with a recently applied patch from IBM, as other jobs are now starting to get the same error when previously they worked without warnings.
mavrick4321
Participant
Posts: 2
Joined: Wed Apr 02, 2008 7:46 pm

Re: Use of Sorting and Downstream operators (Remove Duplicat

Post by mavrick4321 »

The sorting keys orders in partitioning should be the same as like in Columns tab
mavrick4321
Participant
Posts: 2
Joined: Wed Apr 02, 2008 7:46 pm

Re: Use of Sorting and Downstream operators (Remove Duplicat

Post by mavrick4321 »

The sorting keys orders in partitioning tab should be the same as like in Columns tab
Post Reply