Page 1 of 1

Need to Delete duplicated records

Posted: Fri May 20, 2011 11:30 am
by rajudx
Hi,

We need to remove duplicate data based on one key cloumn and date value

eid,date
23,2011-03-21,james
12,2011-04-10,mat
13,1982,03-03,Karth
23,2011-04-21,Maek
12,2011-04-23,Ojem
13-2011-05-13,Kim

need to send create file eid with max date.

Output.
--------
23,2011-04-21,Maek
12,2011-04-23,Ojem
13-2011-05-13,Kim

Some one pelase help how we can get records data based on max date.

Thanks.

Re: Need to Delete duplicated records

Posted: Fri May 20, 2011 11:35 am
by soumya5891
Perform the following process:
1. Perform a sort on the basis of eid(ascending) then date(descending).
2. perform a remove duplicate on eid.

Hope it will works

Posted: Fri May 20, 2011 11:46 am
by rajudx
No.it's not working and duplicate records are not removing and this approach is not working.

Posted: Fri May 20, 2011 11:52 am
by soumya5891
Did u make the partition properly

Posted: Fri May 20, 2011 11:53 am
by DSguru2B
Is that the complete data you are working with? Three columns? If yes then group by on eid and name and take the max date.

Posted: Fri May 20, 2011 1:54 pm
by mobashshar
Do this:
1.Use Remove Duplicate Stage.
2.Sort the Input Field in Remove Duplicate stage on eid and date with asc and make sure you use sort and partition on eid and only sort on date input field.
3. Keep the Last Row.

You will get the desired result.

Posted: Fri May 20, 2011 3:00 pm
by chulett
Are you looking for a Server or a PX solution? You've posted in the PX forum but marked your post as Server, hence the question.

Re: Need to Delete duplicated records

Posted: Sun May 22, 2011 12:17 am
by ds_dwh
I think source will be like this:

eid,date,name
23,2011-03-21,james
12,2011-04-10,mat
13,1982-03-03,Karth
23,2011-04-21,Maek
12,2011-04-23,Ojem
13,2011-05-13,Kim

in this case:
Seqfile---->Sort------->RemoveDup---->Dataset

sort on Eid (descending)
Remove duplicate on Eid, duplicate retain = last

this will work for required o/p


Ram..................