way to Handle duplicate records

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Create an extra column (for example called HowMany) on the output link from the Aggregator that performs the function Count on any non-grouped input column. Feed the output through a Transformer stage. The Transformer stage has at least two outputs, one that is passing all rows output from the Aggregator, the other the duplicates.
Constraint expression on the duplicate-handling link is HowMany > 1.

Note that if you pre-sort your text file on the grouping columns, the DataStage job will use far less memory and run faster.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
raviyn
Participant
Posts: 57
Joined: Mon Dec 16, 2002 6:03 am

Post by raviyn »

Just to add in one more way might be to presort the file and use stage vaiable to compare the existing with the previous and if it matches then catch these "as duplicates" and if not matches then further processing of the link.
But wonder which one will be more faster....
Any comments from anybody on best practices [:D]
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The original poster seemed to want to use an Aggregator stage.

I believe the stage variable method would be faster, certainly it would be less memory-hungry. On the other hand, it does require that the input data be sorted, and we tend not to count the cost when the sort is performed before DataStage is started. [}:)]
Post Reply