To get unique records from a CSV file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
suja.somu
Participant
Posts: 79
Joined: Thu Feb 07, 2013 10:51 pm

To get unique records from a CSV file

Post by suja.somu »

I have a requirement in which, CSV file is the source, I need to get unique records based on a key column and load in a target fixed width file.I have 2 approaches, which of the below works fine, in terms of logic and performance. can anyone help me to list the pros and cons of the below 2 approaches?

Approach 1:

In the seq file target stage, I have set the property to sort-merge collection method ( key column) , perform sort enabled with only unique poperty.

SEQ File stage -> TRANSFORMER -> SEQ file (Fixed witdth file )

here I use tranformer to implement few business rules like concatenation and hardcoding of constants.


Approach 2:

I do HASH partition on the KEY column in Remove Duplicate stage and Perform sort enabled in Input tab

SEQ File stage -> REMOVE DUPS ->TRANSFORMER -> SEQ file (Fixed width file )
suja.somu
Participant
Posts: 79
Joined: Thu Feb 07, 2013 10:51 pm

Post by suja.somu »

Gurus,

Can anyone help me for deciding on anyone of approach.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not sure about "pros and cons", whatever works is a pro in a manner of speaking. Me, I'd prefer the second approach as it makes what you are doing more obvious. I'd probably add an explicit Sort stage as well.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply