Page 1 of 1

Remove duplicates problem

Posted: Fri Jun 22, 2007 2:20 am
by boris_karloff
Hi,

I am having problems with removing duplicates stage in PX. I have a job as follows

Dataset ---- Sort ----- Removed Duplicates -------- Dataset

Here is a sample of the data
UIDCONTRACT COMMISSION_TYPE PRIORITY
694430 LS Commission 1
694430 LS Percentage 2
694430 LS Fixed Fee 3

I sort the data by contract and priority both ascending nulls first.
I then remove duplicates based on the UIDCONTRACT in order to retain the first which I believe should be the LS Commission Priority 1 as I have sorted the data in this way.

Why is the single row that is outputted in the Dataset LS Percentage Prioriy 2 ? I would of thought it should be either 1 or 3 depending on either keeping first or last rows?

I am very confused by this and am having a terrible time trying to work out what is going wrong.

Posted: Fri Jun 22, 2007 3:19 am
by balajisr
Sort key column should be same as remove duplicates key columns.

Your initial sort order with contract and priority would have been reordered by the remove duplicates stage.

Posted: Fri Jun 22, 2007 3:52 am
by boris_karloff
So how would I go about this ? I need to remove the lower tow of them ?

Posted: Fri Jun 22, 2007 4:14 am
by balajisr
You need to remove duplicates on what?

If you need to remove duplicates on UID and Contract then all you need is:

Dataset ---------> Remove Duplicates--------->Dataset

Remove duplicates key being UID and Contract.

Need more details about your expected output.

Posted: Fri Jun 22, 2007 4:39 am
by boris_karloff
What I want to do is to remove the duplicates based on UIDCONTRACT, and obtain the first row i.e. the row with priority 1.

I thought about using an aggregator stage bu that just brings back all the rows as I cant leave out the description.

Posted: Fri Jun 22, 2007 6:06 am
by vijayrc
boris_karloff wrote:What I want to do is to remove the duplicates based on UIDCONTRACT, and obtain the first row i.e. the row with priority 1.

I thought about using an aggregator stage bu that just brings back all the rows as I cant leave out the description.
You can achieve the same with just
Dataset-->Sort-->Dataset
Sort can also be used as a remove duplicate stage, by selecting the option to either retain first or last, with remove duplicate option.
Just my thought

Re: Remove duplicates problem

Posted: Fri Jun 22, 2007 6:52 am
by sachin1
looking at UIDCONTRACT COMMISSION_TYPE, combination of both seems to be unique, please provide more data for input and what exactly output you want.

Posted: Fri Jun 22, 2007 8:34 am
by ray.wurlod
Is the partitioning correct?

Posted: Mon Jun 25, 2007 4:51 am
by boris_karloff
Yes ray I have checked the partitioning and it is sorted and partitioned on UIDCONTRACT.

Posted: Mon Jun 25, 2007 6:09 am
by thompsonp
What settings do you have for sorting and partitioning in the remove duplicates stage?

Partition by UIDCONTRACT and sort by UIDCONTRACT and PRIORITY in the sort stage.

Make sure that the data is not then repartitioned or sorted in the remove duplicates stage (use SAME partitioning if it is being), and the row you specify to keep will depend on whether you sorted PRIORITY ascending or descending in the previous sort stage. The remove duplicate stage should only specify UIDCONTRACT as the key.

You can, if you like, remove the sort stage and do it on the input link to the remove duplicates stage. Partition by UIDCONTRACT and sort by UIDCONTRACT and PRIORITY.

Posted: Mon Jun 25, 2007 6:34 am
by JoshGeorge
What partition are you using? Hash ?

Posted: Wed Jun 27, 2007 3:57 am
by boris_karloff
Josh yes I am using a hash partition, is this the best way ?