Remove duplicates problem

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
boris_karloff
Premium Member
Premium Member
Posts: 18
Joined: Wed Feb 14, 2007 10:04 am

Remove duplicates problem

Post by boris_karloff »

Hi,

I am having problems with removing duplicates stage in PX. I have a job as follows

Dataset ---- Sort ----- Removed Duplicates -------- Dataset

Here is a sample of the data
UIDCONTRACT COMMISSION_TYPE PRIORITY
694430 LS Commission 1
694430 LS Percentage 2
694430 LS Fixed Fee 3

I sort the data by contract and priority both ascending nulls first.
I then remove duplicates based on the UIDCONTRACT in order to retain the first which I believe should be the LS Commission Priority 1 as I have sorted the data in this way.

Why is the single row that is outputted in the Dataset LS Percentage Prioriy 2 ? I would of thought it should be either 1 or 3 depending on either keeping first or last rows?

I am very confused by this and am having a terrible time trying to work out what is going wrong.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Sort key column should be same as remove duplicates key columns.

Your initial sort order with contract and priority would have been reordered by the remove duplicates stage.
boris_karloff
Premium Member
Premium Member
Posts: 18
Joined: Wed Feb 14, 2007 10:04 am

Post by boris_karloff »

So how would I go about this ? I need to remove the lower tow of them ?
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

You need to remove duplicates on what?

If you need to remove duplicates on UID and Contract then all you need is:

Dataset ---------> Remove Duplicates--------->Dataset

Remove duplicates key being UID and Contract.

Need more details about your expected output.
boris_karloff
Premium Member
Premium Member
Posts: 18
Joined: Wed Feb 14, 2007 10:04 am

Post by boris_karloff »

What I want to do is to remove the duplicates based on UIDCONTRACT, and obtain the first row i.e. the row with priority 1.

I thought about using an aggregator stage bu that just brings back all the rows as I cant leave out the description.
vijayrc
Participant
Posts: 197
Joined: Sun Apr 02, 2006 10:31 am
Location: NJ

Post by vijayrc »

boris_karloff wrote:What I want to do is to remove the duplicates based on UIDCONTRACT, and obtain the first row i.e. the row with priority 1.

I thought about using an aggregator stage bu that just brings back all the rows as I cant leave out the description.
You can achieve the same with just
Dataset-->Sort-->Dataset
Sort can also be used as a remove duplicate stage, by selecting the option to either retain first or last, with remove duplicate option.
Just my thought
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

Re: Remove duplicates problem

Post by sachin1 »

looking at UIDCONTRACT COMMISSION_TYPE, combination of both seems to be unique, please provide more data for input and what exactly output you want.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is the partitioning correct?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
boris_karloff
Premium Member
Premium Member
Posts: 18
Joined: Wed Feb 14, 2007 10:04 am

Post by boris_karloff »

Yes ray I have checked the partitioning and it is sorted and partitioned on UIDCONTRACT.
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

What settings do you have for sorting and partitioning in the remove duplicates stage?

Partition by UIDCONTRACT and sort by UIDCONTRACT and PRIORITY in the sort stage.

Make sure that the data is not then repartitioned or sorted in the remove duplicates stage (use SAME partitioning if it is being), and the row you specify to keep will depend on whether you sorted PRIORITY ascending or descending in the previous sort stage. The remove duplicate stage should only specify UIDCONTRACT as the key.

You can, if you like, remove the sort stage and do it on the input link to the remove duplicates stage. Partition by UIDCONTRACT and sort by UIDCONTRACT and PRIORITY.
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

What partition are you using? Hash ?
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
boris_karloff
Premium Member
Premium Member
Posts: 18
Joined: Wed Feb 14, 2007 10:04 am

Post by boris_karloff »

Josh yes I am using a hash partition, is this the best way ?
Post Reply