Remove Duplicates not removing duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pillip
Premium Member
Premium Member
Posts: 50
Joined: Thu Dec 10, 2009 10:43 am

Remove Duplicates not removing duplicates

Post by pillip »

hi,

I have an issue with the remove duplicates stage in the job. It is not removing the duplicates based on the key, even though identical values are coming in the key.
Could you please let me know why this could happen.



Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Is your data sorted properly?
-craig

"You can never have too many knives" -- Logan Nine Fingers
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Re: Remove Duplicates not removing duplicates

Post by kwwilliams »

Two ways I can think of:

1. The data is not hash partitioned in a manner that would have the records on the same partition.
2. The data is not sorted properly. The remove duplicate stage is removing duplicates when they are sorted by the key, essentially removing duplicates that are located one after another.

So is the data sorted and hash paritioned correctly?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Keith has it. Your data are not partitioned correctly or not sorted correctly or both.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Figured I'd start with sorted and go from there. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
pillip
Premium Member
Premium Member
Posts: 50
Joined: Thu Dec 10, 2009 10:43 am

Post by pillip »

chulett wrote:Figured I'd start with sorted and go from there. :wink:
Should the data be sorted and hash partitioned or just sorted?


Thanks
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Unless you are running on a single node, you need both.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply