Removing Duplicates : unique situation
Moderators: chulett, rschirm, roy
Removing Duplicates : unique situation
I have gone through all relevant posts. I have data as follows:
Key: Col2, Col3, Col4
Col1 Col2 Col3 Col4 Col_Date
1,10,A,100,02/10/2011
2,20,A,200,03/10/2011
3,20,A,200,04/10/2011
4,30,B,300,05/10/2011
5,20,A,200,06/10/2011
As you can see, lines 2, 3 and 5 are duplicates in Col2, Col3 and Col4. I would like to remove line 3 which is a duplicate of 2 but would like to keep line 5 since it has appeared later (after line 4 which has a different key) eventhough the 3 key values are the same as lines 2 and 3.
1)I tried the Sort stage with the KeyChange column to True. However, this sorts first which defeats my purpose.
2)I also tried 'Don't Sort, Previously Sorted' for these 3 columns in the Sort stage but that gives an error saying 'Record 4 not sorted'.
So what I need is the KeyChange feature of the Sort stage without doing the sort. Any ideas would be appreciated.
Key: Col2, Col3, Col4
Col1 Col2 Col3 Col4 Col_Date
1,10,A,100,02/10/2011
2,20,A,200,03/10/2011
3,20,A,200,04/10/2011
4,30,B,300,05/10/2011
5,20,A,200,06/10/2011
As you can see, lines 2, 3 and 5 are duplicates in Col2, Col3 and Col4. I would like to remove line 3 which is a duplicate of 2 but would like to keep line 5 since it has appeared later (after line 4 which has a different key) eventhough the 3 key values are the same as lines 2 and 3.
1)I tried the Sort stage with the KeyChange column to True. However, this sorts first which defeats my purpose.
2)I also tried 'Don't Sort, Previously Sorted' for these 3 columns in the Sort stage but that gives an error saying 'Record 4 not sorted'.
So what I need is the KeyChange feature of the Sort stage without doing the sort. Any ideas would be appreciated.
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
-
- Participant
- Posts: 26
- Joined: Mon Aug 27, 2007 6:27 am
- Location: Des Moines
Re: Removing Duplicates : unique situation
Hi ABC,
This you can do it with remove duplicates stage.
One link use retain first record and in the other link use retain last record. Now you will get the 2 and 5 records.
This you can do it with remove duplicates stage.
One link use retain first record and in the other link use retain last record. Now you will get the 2 and 5 records.
Thanks
Vijay
Vijay
Re: Removing Duplicates : unique situation
From your requirements I don't think any suggestion given so far will work.
I believe you want any duplicates that are together to be removed yet if they come later on (after a new key) they should be kept.
First of all the processing should all be done in sequential mode, or else partitioning could screw up the order of the records. This appears to be important.
Then use Stage variable in a transformer to hold the old values, if they all match exclude the record, else pass them through.
I believe you want any duplicates that are together to be removed yet if they come later on (after a new key) they should be kept.
First of all the processing should all be done in sequential mode, or else partitioning could screw up the order of the records. This appears to be important.
Then use Stage variable in a transformer to hold the old values, if they all match exclude the record, else pass them through.
Modern Life is Rubbish - Blur
-
- Premium Member
- Posts: 258
- Joined: Tue Jul 04, 2006 10:35 pm
- Location: Toronto
Re: Removing Duplicates : unique situation
Sort the input data based on the column B and enable the key change column
Use filter to spilt the input data based on change value '1' or '0'
Left Join the '0' stream with '1' stream based on the column B and retrieve the Column A from the '1' Stream .
Filter the data based on Stream0.ColumnA- Stream1.ColumnB<>1 Pass it into a funnel
Funnel the Stream 1 also with this, this will give you the required out put
Note: In case column A is not a Numeric value create a dummy column with Numeric value
Use filter to spilt the input data based on change value '1' or '0'
Left Join the '0' stream with '1' stream based on the column B and retrieve the Column A from the '1' Stream .
Filter the data based on Stream0.ColumnA- Stream1.ColumnB<>1 Pass it into a funnel
Funnel the Stream 1 also with this, this will give you the required out put
Note: In case column A is not a Numeric value create a dummy column with Numeric value
Re: Removing Duplicates : unique situation
This is far too complex and we cannot sort the data for starterspaultechm wrote:Sort the input data based on the column B and enable the key change column
SamYamKrishna is correct, you will need svOldCol1, svOldCol2 etc.
Should be fairly straightforward.
Modern Life is Rubbish - Blur
Very simple solution for this to get only 2 & 5 records.
This you can do it with remove duplicates stage. In one link use remove duplicates stage to retain first record and use remove duplicates stage in the second link to retain last record option and funnel athe data from both the links. Now you will get the 2 and 5 records.
I hope you understand the logic.
This you can do it with remove duplicates stage. In one link use remove duplicates stage to retain first record and use remove duplicates stage in the second link to retain last record option and funnel athe data from both the links. Now you will get the 2 and 5 records.
I hope you understand the logic.
Thanks
Vijay
Vijay
blewip and krishna are correct in their suggestions. Do not sort the file. Read it with sequential file operator and follow that with a transformer running in sequential mode. Use stage variables to detect the change in key column values: if any changes, keep the current record; if no changes then drop the record.
svKeyChange: if svInKey1 = svPrevKey1 and svInKey2 = svPrevKey2 and svInKey3 = svPrevKey3 then 0 else 1
In your output link constraint: svKeyChange = 1
You need to keep the data in one partition (sequential) until you have dropped the records or you will not get the results you need.
Regards,
svKeyChange: if svInKey1 = svPrevKey1 and svInKey2 = svPrevKey2 and svInKey3 = svPrevKey3 then 0 else 1
In your output link constraint: svKeyChange = 1
You need to keep the data in one partition (sequential) until you have dropped the records or you will not get the results you need.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.