Hi,
I have a file where I need to retain only 1 record out of 2 records where 'Col1/Party1 of record1' is equal to 'Col2/Party2 of other record'. These records may not come in sequence.
Sample:
-----------
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 100
I need to retain only 1 record (either 'Record1' or 'Record2'). There is no other common key in between these records and these records are not ordered in the file.
Thanks,
How to retain ONE record after comparing 2 different columns
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 41
- Joined: Wed Oct 08, 2008 9:19 am
is the data really big? If not, can you copy the input and then do a join stage? I think that might work, what you wold do is something like
read file
copy stage
modify left, add party1 as new field 'key'
modify right, add party2 as new field 'key'
join stage, on key
if the data is too big for this, there may be a way to do it with less brute force.
don't forget to hash partition on 'key'.
read file
copy stage
modify left, add party1 as new field 'key'
modify right, add party2 as new field 'key'
join stage, on key
if the data is too big for this, there may be a way to do it with less brute force.
don't forget to hash partition on 'key'.
-
- Participant
- Posts: 41
- Joined: Wed Oct 08, 2008 9:19 am
-
- Participant
- Posts: 41
- Joined: Wed Oct 08, 2008 9:19 am
After certain tests, got the solution.
I compared the 'Col1' and 'Col2' (either it is number or string) and created a new KEY column.
If Col1<= Col2 then Key=Col1:Col2
else Key=Col2:Col1
This will give same key for both of the records and then de-duplicate the record based on this 'Key' column.
Thanks for support.
I compared the 'Col1' and 'Col2' (either it is number or string) and created a new KEY column.
If Col1<= Col2 then Key=Col1:Col2
else Key=Col2:Col1
This will give same key for both of the records and then de-duplicate the record based on this 'Key' column.
Thanks for support.
Rohit
Your solution should work based on your sample data.
In case your data comes in a different way, like below, where 200=200 but the other values differ, then it would not remove the duplicates.
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 300
Have you tested this scenario as well, or is there no way the data would be like this?
In case your data comes in a different way, like below, where 200=200 but the other values differ, then it would not remove the duplicates.
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 300
Have you tested this scenario as well, or is there no way the data would be like this?
Choose a job you love, and you will never have to work a day in your life. - Confucius
you can do something snarky like make the key A*B (200*300 is the same as 300*200) or sort the 2 keys (stage variables can do this for 2 values really simple) and put them in the same order every time (so that 200,300 is 200,300 no matter how they came into the functions). Multiply has a risk of other ways to get the same product (1, 60000 here) but there is probably some nifty math way that will work for your data.