Remove duplicates problem

boris_karloff · Post by **boris_karloff** » Fri Jun 22, 2007 2:20 am

Hi,

I am having problems with removing duplicates stage in PX. I have a job as follows

Dataset ---- Sort ----- Removed Duplicates -------- Dataset

Here is a sample of the data
UIDCONTRACT COMMISSION_TYPE PRIORITY
694430 LS Commission 1
694430 LS Percentage 2
694430 LS Fixed Fee 3

I sort the data by contract and priority both ascending nulls first.
I then remove duplicates based on the UIDCONTRACT in order to retain the first which I believe should be the LS Commission Priority 1 as I have sorted the data in this way.

Why is the single row that is outputted in the Dataset LS Percentage Prioriy 2 ? I would of thought it should be either 1 or 3 depending on either keeping first or last rows?

I am very confused by this and am having a terrible time trying to work out what is going wrong.

balajisr · Post by **balajisr** » Fri Jun 22, 2007 3:19 am

Sort key column should be same as remove duplicates key columns.

Your initial sort order with contract and priority would have been reordered by the remove duplicates stage.

boris_karloff · Post by **boris_karloff** » Fri Jun 22, 2007 3:52 am

So how would I go about this ? I need to remove the lower tow of them ?

balajisr · Post by **balajisr** » Fri Jun 22, 2007 4:14 am

You need to remove duplicates on what?

If you need to remove duplicates on UID and Contract then all you need is:

Dataset ---------> Remove Duplicates--------->Dataset

Remove duplicates key being UID and Contract.

Need more details about your expected output.

boris_karloff · Post by **boris_karloff** » Fri Jun 22, 2007 4:39 am

What I want to do is to remove the duplicates based on UIDCONTRACT, and obtain the first row i.e. the row with priority 1.

I thought about using an aggregator stage bu that just brings back all the rows as I cant leave out the description.

vijayrc · Post by **vijayrc** » Fri Jun 22, 2007 6:06 am

boris_karloff wrote:What I want to do is to remove the duplicates based on UIDCONTRACT, and obtain the first row i.e. the row with priority 1.

I thought about using an aggregator stage bu that just brings back all the rows as I cant leave out the description.

You can achieve the same with just
Dataset-->Sort-->Dataset
Sort can also be used as a remove duplicate stage, by selecting the option to either retain first or last, with remove duplicate option.
Just my thought

sachin1 · Post by **sachin1** » Fri Jun 22, 2007 6:52 am

looking at UIDCONTRACT COMMISSION_TYPE, combination of both seems to be unique, please provide more data for input and what exactly output you want.

ray.wurlod · Post by **ray.wurlod** » Fri Jun 22, 2007 8:34 am

Is the partitioning correct?

boris_karloff · Post by **boris_karloff** » Mon Jun 25, 2007 4:51 am

Yes ray I have checked the partitioning and it is sorted and partitioned on UIDCONTRACT.

thompsonp · Post by **thompsonp** » Mon Jun 25, 2007 6:09 am

What settings do you have for sorting and partitioning in the remove duplicates stage?

Partition by UIDCONTRACT and sort by UIDCONTRACT and PRIORITY in the sort stage.

Make sure that the data is not then repartitioned or sorted in the remove duplicates stage (use SAME partitioning if it is being), and the row you specify to keep will depend on whether you sorted PRIORITY ascending or descending in the previous sort stage. The remove duplicate stage should only specify UIDCONTRACT as the key.

You can, if you like, remove the sort stage and do it on the input link to the remove duplicates stage. Partition by UIDCONTRACT and sort by UIDCONTRACT and PRIORITY.

JoshGeorge · Post by **JoshGeorge** » Mon Jun 25, 2007 6:34 am

What partition are you using? Hash ?

boris_karloff · Post by **boris_karloff** » Wed Jun 27, 2007 3:57 am

Josh yes I am using a hash partition, is this the best way ?

DSXchange

Remove duplicates problem

Remove duplicates problem

Re: Remove duplicates problem