Efficient Way to remove Duplicates and store them

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
gsym
Charter Member
Charter Member
Posts: 118
Joined: Thu Feb 02, 2006 3:05 pm

Efficient Way to remove Duplicates and store them

Post by gsym »

Hi,

Can Anyone let me know the most eficient way to remove duplicates and store them.....

I can acheive this by using a transfromer and use a stage variabe or use sort and key change column.

Transformer becomes sequential and Sorting takes long time....

Is there any other way?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

There are only two basic methods of getting duplicates, either sorting the data by the column in question and comparing adjacent rows, or by keeping the whole contents of the file as it is being read in memory and doing a comparison (i.e. array or table scan). Either method may be quicker and more efficient depending upon the level of sorting, amount of duplication, and number of records.
Sorting algorithms are generally efficient, and sorted data has other benefits, so I tend to prefer sorting and using a transform stage variable.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I prefer a fork join design, where one leg of the fork counts the number of rows associated with each key value - this is joined back to the original row and used to filter based on count = 1 or otherwise.

Possibly easier is to generate a Key Change column in the Sort stage; duplicates have this set to 0.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gsym
Charter Member
Charter Member
Posts: 118
Joined: Thu Feb 02, 2006 3:05 pm

Post by gsym »

hey Ray,

Thnaks for the response. Sometime, when I use fork-join, the job just runs for long time. I mean it never finishes. But I think I can implement it in 2 step process by temporary storage in Dataset.

But still I would require aggrerator and a Filter/Transformer to implement this. When the volume is huge, Won't aggrerator be time consuming?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

When the volume is huge ANYTHING will be time consuming.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sima79
Premium Member
Premium Member
Posts: 38
Joined: Mon Jul 16, 2007 8:12 am
Location: Melbourne, Australia

Post by sima79 »

I would avoid using a fork join, it is possible that your job will wait forever. See the advanced developers guide under chapter 4: Link Buffering:

Deadlock situations can occur where you have a fork-join in your job. This is where a stage has two output links whose data paths are joined together later in the job. The situation can arise where all the stages in the flow are waiting for each other to read or write, so none of them can proceed

I have seen smaller jobs finish ok with a fork join, as DataStage inserts buffers when a fork join is used. If you want to go with the fork join design you may need to alter the buffer handling for the input links on the join stage.
Post Reply