Remove duplicate stage behaviour

dspxlearn · Post by **dspxlearn** » Sat Sep 17, 2005 4:17 am

Hi all,

My requirement was to remove the duplicate records from the source(Sequential file).So i was taking a sequential file-->Remove duplicate stage-->sequential file.
There are two columns say col1,col2.I gave duplicate to retain option as --'first'.
When i gave partition type option as 'same', the duplicates are being removed and the output was also sorted based on the 'col1'.Now when i give partition type as 'random' or 'range' ...the duplicate records are not being removed and instead all the records are coming...
Why was it happening like that...

ray.wurlod · Post by **ray.wurlod** » Sat Sep 17, 2005 6:06 pm

Clearly what is happening is that the different partitioning algorithm is able to direct duplicate rows to different processing nodes. When you specify "same", you use the (non-partitioned) sequential processing method - I'm making some assumptions about your Sequential File format here of course.

When you specify random or round robin you are probably sending any pair of adjacent lines from the file to different processing nodes. The Remove Duplicates stage on any one node processes only the rows that it receives on that node.

kumar_s · Post by **kumar_s** » Sun Sep 18, 2005 12:28 am

Hi,
In your post i couldnt find indication of sorted data. I worried wether is it a sorted data.
Coz Remove dupicate requires sorted data.
This stage works fine if you maintain the 'hash' partition, or 'range' partition in the sort stage and if you maintain the 'same' partition in RD ir follows the same partiton of sort.

As ray explained, defenitely roundrobin or random cant work in this case.

regards
kumar

dspxlearn · Post by **dspxlearn** » Mon Sep 19, 2005 12:05 am

Hi,

"same'' type of partitioning means i believe..it performs no repartitioning.and since i am using a sequential file as my source i will not run parallely.
(i am using seq file-->remov duplicate-->seq file)
for Eg: my input data from seq file is

5,gun1
2,how
2,how1
4,you
4,you1
1,hi
3,are
3,are1
1,hi1
4,you2
3,cat
4,fan
2,pen
5,gun
1,lan
4,tin

expected output may be
5,gun1
2,how
4,you
1,hi
3,are

But the output is (using 'same' type of partitioning)
1,hi
2,how
3,are
4,you
5,gun1

It is also sorting the data...why is it working this way..

ray.wurlod · Post by **ray.wurlod** » Mon Sep 19, 2005 1:05 am

There is a sort performed on the input link of the Remove Duplicates stage (unless data are already sorted appropriately), because having this situation means that far less memory can be consumed by the stage. As soon as any of the key columns changes values in sorted data, it is known that that value will not recur and any memory it is occupying can be flushed and freed.

kumar_s · Post by **kumar_s** » Mon Sep 19, 2005 2:13 am

Hi,
I guess you misplaced the sort, or you might have elabled the collector methed as sort and expecting the proper output

regards
kumar

dspxlearn · Post by **dspxlearn** » Mon Sep 19, 2005 4:14 am

hi ray.wurlod,

Your case works when you specify 'Same' as the partitioning method.
Does it mean that if two records with the same key column value comes from the source,remove duplicate stage will take the first one and if any further repitition of the same key value ,it will just ignore it...??This case works fine with 'Same" partitioning method...But it is not giving the expected result with other partitions...
If i am wrong please clarify me..

Kumar_s,
I havent specified the collector methed as sort but still i am getting this..So, in a case where i have to sort the data before passing to the next stage,instead of using sort stage,i can use remove duplicate stage with partition method as 'same'..So, it will also remove the duplicate and also sorts...

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Mon Sep 19, 2005 4:29 am

By using different paritioning method, you are re-partitioning the data. So you are having the problem.

This is what others are explaining in different words.

kumar_s · Post by **kumar_s** » Mon Sep 19, 2005 10:55 am

Hi,
Is that the presort checked in the RD stage???

regards
kumar

KoningB · Post by **KoningB** » Fri May 04, 2007 3:15 pm

I realize this is an old thread, but I just came across this problem and have a solution for it.

If you need to remove duplicates from a file but want to keep the record structure intact, just with the duplicates missing, here's what you do:

Take your file and first put it through a transform stage before it gets to the remove duplicates stage. So the stages should go

Seq->Trans->RemoveDup->Seq

In the transform stage, add a column with type 'Int' and the derivation '@INROWNUM'. This will populate the column with the record number. Record 1 gets a 1 and so on.

*NOTE* You must set both the file stage and the transform stage to Sequential job processing to keep the order of the records correct. Therefore this is not recommended for very large files.

Next just put the file through the Remove Duplicates stage just as you would originally, removing duplicates on your original column(s). Now, in the target file stage, have it sort based upon that column that was added in the transformer. The result will be your original record order, with any duplicates just removed.

Hope this helps!

DSXchange

Remove duplicate stage behaviour

Remove duplicate stage behaviour

Possible solution for small sized files