Remove duplicate stage behaviour

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dspxlearn
Premium Member
Premium Member
Posts: 291
Joined: Sat Sep 10, 2005 1:26 am

Remove duplicate stage behaviour

Post by dspxlearn »

Hi all,

My requirement was to remove the duplicate records from the source(Sequential file).So i was taking a sequential file-->Remove duplicate stage-->sequential file.
There are two columns say col1,col2.I gave duplicate to retain option as --'first'.
When i gave partition type option as 'same', the duplicates are being removed and the output was also sorted based on the 'col1'.Now when i give partition type as 'random' or 'range' ...the duplicate records are not being removed and instead all the records are coming...
Why was it happening like that... :(
Thanks and Regards!!
dspxlearn
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Clearly what is happening is that the different partitioning algorithm is able to direct duplicate rows to different processing nodes. When you specify "same", you use the (non-partitioned) sequential processing method - I'm making some assumptions about your Sequential File format here of course.

When you specify random or round robin you are probably sending any pair of adjacent lines from the file to different processing nodes. The Remove Duplicates stage on any one node processes only the rows that it receives on that node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
In your post i couldnt find indication of sorted data. I worried wether is it a sorted data.
Coz Remove dupicate requires sorted data.
This stage works fine if you maintain the 'hash' partition, or 'range' partition in the sort stage and if you maintain the 'same' partition in RD ir follows the same partiton of sort.

As ray explained, defenitely roundrobin or random cant work in this case.

regards
kumar
dspxlearn
Premium Member
Premium Member
Posts: 291
Joined: Sat Sep 10, 2005 1:26 am

Post by dspxlearn »

Hi,

"same'' type of partitioning means i believe..it performs no repartitioning.and since i am using a sequential file as my source i will not run parallely.
(i am using seq file-->remov duplicate-->seq file)
for Eg: my input data from seq file is

5,gun1
2,how
2,how1
4,you
4,you1
1,hi
3,are
3,are1
1,hi1
4,you2
3,cat
4,fan
2,pen
5,gun
1,lan
4,tin


expected output may be
5,gun1
2,how
4,you
1,hi
3,are

But the output is (using 'same' type of partitioning)
1,hi
2,how
3,are
4,you
5,gun1

It is also sorting the data...why is it working this way..
Thanks and Regards!!
dspxlearn
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is a sort performed on the input link of the Remove Duplicates stage (unless data are already sorted appropriately), because having this situation means that far less memory can be consumed by the stage. As soon as any of the key columns changes values in sorted data, it is known that that value will not recur and any memory it is occupying can be flushed and freed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
I guess you misplaced the sort, or you might have elabled the collector methed as sort and expecting the proper output :roll:

regards
kumar
dspxlearn
Premium Member
Premium Member
Posts: 291
Joined: Sat Sep 10, 2005 1:26 am

Post by dspxlearn »

hi ray.wurlod,

Your case works when you specify 'Same' as the partitioning method.
Does it mean that if two records with the same key column value comes from the source,remove duplicate stage will take the first one and if any further repitition of the same key value ,it will just ignore it...??This case works fine with 'Same" partitioning method...But it is not giving the expected result with other partitions...
If i am wrong please clarify me..

Kumar_s,
I havent specified the collector methed as sort but still i am getting this..So, in a case where i have to sort the data before passing to the next stage,instead of using sort stage,i can use remove duplicate stage with partition method as 'same'..So, it will also remove the duplicate and also sorts...
Thanks and Regards!!
dspxlearn
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

By using different paritioning method, you are re-partitioning the data. So you are having the problem.

This is what others are explaining in different words.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,
Is that the presort checked in the RD stage???

regards
kumar
KoningB
Participant
Posts: 7
Joined: Tue Mar 06, 2007 1:33 pm

Possible solution for small sized files

Post by KoningB »

I realize this is an old thread, but I just came across this problem and have a solution for it.

If you need to remove duplicates from a file but want to keep the record structure intact, just with the duplicates missing, here's what you do:

Take your file and first put it through a transform stage before it gets to the remove duplicates stage. So the stages should go

:!: Seq->Trans->RemoveDup->Seq :!:

In the transform stage, add a column with type 'Int' and the derivation '@INROWNUM'. This will populate the column with the record number. Record 1 gets a 1 and so on.

*NOTE* You must set both the file stage and the transform stage to Sequential job processing to keep the order of the records correct. Therefore this is not recommended for very large files. :idea:

Next just put the file through the Remove Duplicates stage just as you would originally, removing duplicates on your original column(s). Now, in the target file stage, have it sort based upon that column that was added in the transformer. The result will be your original record order, with any duplicates just removed.

Hope this helps!
The most likely way for the world to be destroyed, most experts agree, is by accident. That's where we come in; we're computer professionals. We cause accidents.

--Nathaniel Borenstein
Post Reply