way to Handle duplicate records

ray.wurlod · Post by **ray.wurlod** » Fri Feb 21, 2003 12:37 am

Create an extra column (for example called HowMany) on the output link from the Aggregator that performs the function Count on any non-grouped input column. Feed the output through a Transformer stage. The Transformer stage has at least two outputs, one that is passing all rows output from the Aggregator, the other the duplicates.
Constraint expression on the duplicate-handling link is HowMany > 1.

Note that if you pre-sort your text file on the grouping columns, the DataStage job will use far less memory and run faster.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

raviyn · Post by **raviyn** » Fri Feb 21, 2003 1:53 am

Just to add in one more way might be to presort the file and use stage vaiable to compare the existing with the previous and if it matches then catch these "as duplicates" and if not matches then further processing of the link.
But wonder which one will be more faster....
Any comments from anybody on best practices [:D]

ray.wurlod · Post by **ray.wurlod** » Fri Feb 21, 2003 3:03 am

The original poster seemed to want to use an Aggregator stage.

I believe the stage variable method would be faster, certainly it would be less memory-hungry. On the other hand, it does require that the input data be sorted, and we tend not to count the cost when the sort is performed before DataStage is started. [}:)]