Page 1 of 1

To get unique records from a CSV file

Posted: Fri Feb 15, 2013 1:17 pm
by suja.somu
I have a requirement in which, CSV file is the source, I need to get unique records based on a key column and load in a target fixed width file.I have 2 approaches, which of the below works fine, in terms of logic and performance. can anyone help me to list the pros and cons of the below 2 approaches?

Approach 1:

In the seq file target stage, I have set the property to sort-merge collection method ( key column) , perform sort enabled with only unique poperty.

SEQ File stage -> TRANSFORMER -> SEQ file (Fixed witdth file )

here I use tranformer to implement few business rules like concatenation and hardcoding of constants.


Approach 2:

I do HASH partition on the KEY column in Remove Duplicate stage and Perform sort enabled in Input tab

SEQ File stage -> REMOVE DUPS ->TRANSFORMER -> SEQ file (Fixed width file )

Posted: Mon Feb 18, 2013 1:57 pm
by suja.somu
Gurus,

Can anyone help me for deciding on anyone of approach.

Posted: Mon Feb 18, 2013 2:06 pm
by chulett
Not sure about "pros and cons", whatever works is a pro in a manner of speaking. Me, I'd prefer the second approach as it makes what you are doing more obvious. I'd probably add an explicit Sort stage as well.