DSXchange

Posted: **Fri Feb 15, 2013 1:17 pm**

I have a requirement in which, CSV file is the source, I need to get unique records based on a key column and load in a target fixed width file.I have 2 approaches, which of the below works fine, in terms of logic and performance. can anyone help me to list the pros and cons of the below 2 approaches?

Approach 1:

In the seq file target stage, I have set the property to sort-merge collection method ( key column) , perform sort enabled with only unique poperty.

SEQ File stage -> TRANSFORMER -> SEQ file (Fixed witdth file )

here I use tranformer to implement few business rules like concatenation and hardcoding of constants.

Approach 2:

I do HASH partition on the KEY column in Remove Duplicate stage and Perform sort enabled in Input tab

SEQ File stage -> REMOVE DUPS ->TRANSFORMER -> SEQ file (Fixed width file )

Posted: **Mon Feb 18, 2013 1:57 pm**

Gurus,

Can anyone help me for deciding on anyone of approach.

Posted: **Mon Feb 18, 2013 2:06 pm**

Not sure about "pros and cons", whatever works is a pro in a manner of speaking. Me, I'd prefer the second approach as it makes what you are doing more obvious. I'd probably add an explicit Sort stage as well.

DSXchange

To get unique records from a CSV file

To get unique records from a CSV file