Page 1 of 1

Removing Duplicates : unique situation

Posted: Tue Apr 26, 2011 12:29 am
by abc123
I have gone through all relevant posts. I have data as follows:

Key: Col2, Col3, Col4

Col1 Col2 Col3 Col4 Col_Date
1,10,A,100,02/10/2011
2,20,A,200,03/10/2011
3,20,A,200,04/10/2011
4,30,B,300,05/10/2011
5,20,A,200,06/10/2011

As you can see, lines 2, 3 and 5 are duplicates in Col2, Col3 and Col4. I would like to remove line 3 which is a duplicate of 2 but would like to keep line 5 since it has appeared later (after line 4 which has a different key) eventhough the 3 key values are the same as lines 2 and 3.

1)I tried the Sort stage with the KeyChange column to True. However, this sorts first which defeats my purpose.

2)I also tried 'Don't Sort, Previously Sorted' for these 3 columns in the Sort stage but that gives an error saying 'Record 4 not sorted'.

So what I need is the KeyChange feature of the Sort stage without doing the sort. Any ideas would be appreciated.

Posted: Tue Apr 26, 2011 1:18 am
by abc123
Thank you tbharathkumar for your response.

Unfortunately, this is more complex. Both Line 2 and Line 5 need to be retained eventhough they are duplicates. Please read my first post again.

Thanks.

Posted: Tue Apr 26, 2011 1:43 am
by zulfi123786
the keychange column of sort stage can be implemented very easily in the transformer stage using a combination of stage variables by holding the key value of current and previous records

Posted: Tue Apr 26, 2011 2:34 am
by tbharathkumar
Designed mentioned will retain both line 2 and line 5.... As mentined by another user you can implement in transformer also(DS Ver 8.x)

Re: Removing Duplicates : unique situation

Posted: Tue Apr 26, 2011 3:37 am
by VijayDS
Hi ABC,

This you can do it with remove duplicates stage.
One link use retain first record and in the other link use retain last record. Now you will get the 2 and 5 records.

Re: Removing Duplicates : unique situation

Posted: Tue Apr 26, 2011 4:05 am
by blewip
From your requirements I don't think any suggestion given so far will work.

I believe you want any duplicates that are together to be removed yet if they come later on (after a new key) they should be kept.

First of all the processing should all be done in sequential mode, or else partitioning could screw up the order of the records. This appears to be important.

Then use Stage variable in a transformer to hold the old values, if they all match exclude the record, else pass them through.

Posted: Tue Apr 26, 2011 5:45 am
by abc123
blewip, you are right. You seem to understand the requirement. I know about the sequential mode part.

Can you please give little bit more detail about your solution?

Thanks.

Posted: Tue Apr 26, 2011 6:02 am
by samyamkrishna
If the data is coming from a sequential file in the order that you have mentioned.

in the transformer check if the current record is same as the previous record.( on the key columns)

if they are same then drop the second record.

this way you will be able to retain the 2 and the 5 records.

Re: Removing Duplicates : unique situation

Posted: Tue Apr 26, 2011 8:50 am
by paultechm
Sort the input data based on the column B and enable the key change column


Use filter to spilt the input data based on change value '1' or '0'
Left Join the '0' stream with '1' stream based on the column B and retrieve the Column A from the '1' Stream .

Filter the data based on Stream0.ColumnA- Stream1.ColumnB<>1 Pass it into a funnel

Funnel the Stream 1 also with this, this will give you the required out put

Note: In case column A is not a Numeric value create a dummy column with Numeric value

Re: Removing Duplicates : unique situation

Posted: Tue Apr 26, 2011 9:26 am
by blewip
paultechm wrote:Sort the input data based on the column B and enable the key change column
This is far too complex and we cannot sort the data for starters

SamYamKrishna is correct, you will need svOldCol1, svOldCol2 etc.

Should be fairly straightforward.

Posted: Tue Apr 26, 2011 11:25 pm
by VijayDS
Very simple solution for this to get only 2 & 5 records.

This you can do it with remove duplicates stage. In one link use remove duplicates stage to retain first record and use remove duplicates stage in the second link to retain last record option and funnel athe data from both the links. Now you will get the 2 and 5 records.

I hope you understand the logic.

Posted: Wed Apr 27, 2011 3:50 am
by blewip
VijayDS wrote:I hope you understand the logic.
I understand your logic but it's flawed.

If a copy of record 5 appeared at record 10; you would want records 2,5 and 10 to be kept. Your logic would only keep 2 and 10.

Posted: Wed Apr 27, 2011 1:51 pm
by jwiles
blewip and krishna are correct in their suggestions. Do not sort the file. Read it with sequential file operator and follow that with a transformer running in sequential mode. Use stage variables to detect the change in key column values: if any changes, keep the current record; if no changes then drop the record.

svKeyChange: if svInKey1 = svPrevKey1 and svInKey2 = svPrevKey2 and svInKey3 = svPrevKey3 then 0 else 1

In your output link constraint: svKeyChange = 1

You need to keep the data in one partition (sequential) until you have dropped the records or you will not get the results you need.

Regards,

Posted: Wed Apr 27, 2011 1:56 pm
by gateleys
Assuming that your sequential file contains the rows in that order, in the Transformer, create a stage variable where you compare the previous value of the concatenated keys with the current value.