How to retain ONE record after comparing 2 different columns

rohit_mca2003 · Post by **rohit_mca2003** » Wed Jan 30, 2019 7:55 pm

Hi,
I have a file where I need to retain only 1 record out of 2 records where 'Col1/Party1 of record1' is equal to 'Col2/Party2 of other record'. These records may not come in sequence.

Sample:
-----------
Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 100

I need to retain only 1 record (either 'Record1' or 'Record2'). There is no other common key in between these records and these records are not ordered in the file.

Thanks,

UCDI · Post by **UCDI** » Thu Jan 31, 2019 3:52 am

is the data really big? If not, can you copy the input and then do a join stage? I think that might work, what you wold do is something like

read file
copy stage
modify left, add party1 as new field 'key'
modify right, add party2 as new field 'key'
join stage, on key

if the data is too big for this, there may be a way to do it with less brute force.

don't forget to hash partition on 'key'.

rohit_mca2003 · Post by **rohit_mca2003** » Thu Jan 31, 2019 3:34 pm

Thanks for the reply. I thought about this approach but it may not work. As what ever join will happen with Record1 (party1) & Record2 (party2), same join will happen with 'Record2 (party1) and Record1 (Party2).
So we will end up with 2 records again.

Thanks,

rohit_mca2003 · Post by **rohit_mca2003** » Thu Jan 31, 2019 7:58 pm

After certain tests, got the solution.
I compared the 'Col1' and 'Col2' (either it is number or string) and created a new KEY column.

If Col1<= Col2 then Key=Col1:Col2
else Key=Col2:Col1

This will give same key for both of the records and then de-duplicate the record based on this 'Key' column.

Thanks for support.

qt_ky · Post by **qt_ky** » Fri Feb 01, 2019 1:46 am

Your solution should work based on your sample data.

In case your data comes in a different way, like below, where 200=200 but the other values differ, then it would not remove the duplicates.

Column Party1 Party2
--------------------------------------
Record1 --> 100 200
Record2 --> 200 300

Have you tested this scenario as well, or is there no way the data would be like this?

UCDI · Post by **UCDI** » Wed Feb 06, 2019 8:15 am

you can do something snarky like make the key A*B (200*300 is the same as 300*200) or sort the 2 keys (stage variables can do this for 2 values really simple) and put them in the same order every time (so that 200,300 is 200,300 no matter how they came into the functions). Multiply has a risk of other ways to get the same product (1, 60000 here) but there is probably some nifty math way that will work for your data.