To Get the First Duplicate Record from HashFile Output

tombastian · Post by **tombastian** » Wed Jun 15, 2005 6:37 am

Hi All,
I have a Hash File Stage which has few duplicate key records going in and as HashFile Stage works, I am getting last input duplicate key record as output. Is there a way to get the first record among the duplicates as the output. I am using a sort stage and and a surrogate key to get the first one in the output but would like to know whether there is a better option using some functionality of Hashfile stage itself.

Input to Hash File

Col1(Key Field in HF) Col2 Col3
100 ABC C99
100 RXZ G77
100 JKL G77
115 XYZ R33

Normal Output

Col1(Key Field in HF) Col2 Col3
100 JKL G77
115 XYZ R33

Required Output

Col1(Key Field in HF) Col2 Col3
100 ABC C99
115 XYZ R33

Thanks in Advance,
Tom.

chulett · Post by **chulett** » Wed Jun 15, 2005 6:45 am

If you want the 'first' rather than the 'last', you need to sort input on your key fields in a descending order rather than ascending. Then all you'll have in the hash when you are done are the first (lowest) values for any duplicate keys.

ArndW · Post by **ArndW** » Wed Jun 15, 2005 6:47 am

The Hash file key must be different from what you've stated, but the general command for the hash file SELECT would read

SELECT HF BY Col1 BREAK.ON Col1 DET.SUP

ray.wurlod · Post by **ray.wurlod** » Wed Jun 15, 2005 3:06 pm

Either sort data to be loaded into the hashed file in reverse order, as Craig suggested, or de-duplicate the data by other means before loading them into the hashed file.
All writes to hashed files via the Hashed File stage are destructive overwrites.
If you use a UV stage to insert rows you will achieve what you want, but generate warnings (row already exists) for each duplicate key value. The UV stage uses SQL.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed Jun 15, 2005 3:22 pm

Pass into agg stage with 'first' option for the fields.

Sreenivasulu · Post by **Sreenivasulu** » Wed Jun 15, 2005 8:45 pm

Use "first" in Aggregrator stage to get the frist duplicate record
or "last" to get the last duplicate record

Regards
Sreenviasulu