Efficient Way to remove Duplicates and store them

gsym · Post by **gsym** » Fri Mar 13, 2009 11:49 am

Hi,

Can Anyone let me know the most eficient way to remove duplicates and store them.....

I can acheive this by using a transfromer and use a stage variabe or use sort and key change column.

Transformer becomes sequential and Sorting takes long time....

Is there any other way?

ArndW · Post by **ArndW** » Fri Mar 13, 2009 1:54 pm

There are only two basic methods of getting duplicates, either sorting the data by the column in question and comparing adjacent rows, or by keeping the whole contents of the file as it is being read in memory and doing a comparison (i.e. array or table scan). Either method may be quicker and more efficient depending upon the level of sorting, amount of duplication, and number of records.
Sorting algorithms are generally efficient, and sorted data has other benefits, so I tend to prefer sorting and using a transform stage variable.

ray.wurlod · Post by **ray.wurlod** » Sat Mar 14, 2009 12:38 am

I prefer a fork join design, where one leg of the fork counts the number of rows associated with each key value - this is joined back to the original row and used to filter based on count = 1 or otherwise.

Possibly easier is to generate a Key Change column in the Sort stage; duplicates have this set to 0.

gsym · Post by **gsym** » Mon Mar 16, 2009 2:30 pm

hey Ray,

Thnaks for the response. Sometime, when I use fork-join, the job just runs for long time. I mean it never finishes. But I think I can implement it in 2 step process by temporary storage in Dataset.

But still I would require aggrerator and a Filter/Transformer to implement this. When the volume is huge, Won't aggrerator be time consuming?

ray.wurlod · Post by **ray.wurlod** » Mon Mar 16, 2009 3:22 pm

When the volume is huge ANYTHING will be time consuming.

sima79 · Post by **sima79** » Mon Mar 16, 2009 4:05 pm

I would avoid using a fork join, it is possible that your job will wait forever. See the advanced developers guide under chapter 4: Link Buffering:

Deadlock situations can occur where you have a fork-join in your job. This is where a stage has two output links whose data paths are joined together later in the job. The situation can arise where all the stages in the flow are waiting for each other to read or write, so none of them can proceed

I have seen smaller jobs finish ok with a fork join, as DataStage inserts buffers when a fork join is used. If you want to go with the fork join design you may need to alter the buffer handling for the input links on the join stage.