To get unique records from a CSV file

suja.somu · Post by **suja.somu** » Fri Feb 15, 2013 1:17 pm

I have a requirement in which, CSV file is the source, I need to get unique records based on a key column and load in a target fixed width file.I have 2 approaches, which of the below works fine, in terms of logic and performance. can anyone help me to list the pros and cons of the below 2 approaches?

Approach 1:

In the seq file target stage, I have set the property to sort-merge collection method ( key column) , perform sort enabled with only unique poperty.

SEQ File stage -> TRANSFORMER -> SEQ file (Fixed witdth file )

here I use tranformer to implement few business rules like concatenation and hardcoding of constants.

Approach 2:

I do HASH partition on the KEY column in Remove Duplicate stage and Perform sort enabled in Input tab

SEQ File stage -> REMOVE DUPS ->TRANSFORMER -> SEQ file (Fixed width file )

suja.somu · Post by **suja.somu** » Mon Feb 18, 2013 1:57 pm

Gurus,

Can anyone help me for deciding on anyone of approach.

chulett · Post by **chulett** » Mon Feb 18, 2013 2:06 pm

Not sure about "pros and cons", whatever works is a pro in a manner of speaking. Me, I'd prefer the second approach as it makes what you are doing more obvious. I'd probably add an explicit Sort stage as well.