how to separate the duplicate and rest in other file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
hemanth12
Participant
Posts: 6
Joined: Tue Feb 01, 2011 11:45 pm

how to separate the duplicate and rest in other file

Post by hemanth12 »

How can I separate the duplicate and rest of the rows in other file. This is only using of transformer stage don't use any partitions and sort or remove duplicate stage..

Thanks in advance.
DATASTAGE DEVELOPER
veerabusani185512
Participant
Posts: 11
Joined: Fri Jan 30, 2009 3:21 am

Post by veerabusani185512 »

If your source is Sequential file then..In properties tab-->Filter option...try to use sort -u #FileNamePath#....Which will select only unique records from sequential file stage
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I don't answer interview questions. The correct answer DOES includes the stage types you insist on excluding.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Argh... why in the heck was this set up as a poll? Please don't do that as that's not something I can undo. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

I, however can! Poll deleted...
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

The correct answer is:

1) Sort the data being sent to the transformer on the input link (no separate sort stage required) by all the required keys. Hash Partition on at least the major key to insure if there are duplicates, they will end up in the same partition. If you aren't allowed to partition (STUPID REQUIREMENT) then set the stage to operate in sequential mode instead of parallel mode.
2) Setup an integer stage variable called "svIsDuplicate" and initialize it to 0 (False).
3) Setup stage variables to hold each of your keys initialized to "".
4) Stage variables are processed in order from top to bottom. So first determine if the incoming row's keys all match the keys from the previous record that you are currently storing in the stage variables. If they all match the row is a duplicate so set svIsDuplicate to 1, else set it to 0.
5) Then reset all the saved variables holding your keys to hold the keys for the record you just read.
6) Have two separate identical output links, and add constraints to your output links so that one link gets output when svIsDuplicate is 0 (no duplicate) and another link gets output when svIsDuplicate is 1 (yes - duplicate).
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

asorrell wrote:I, however can! Poll deleted...
You're not supposed to delete knucklehead polls: you're supposed to add amusing options to give this otherwise useless interview question thread a little value. :lol:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Agreed
:lol:
:twisted:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply