Page 1 of 1

Removing Duplicates only using transformer stage

Posted: Mon Oct 03, 2011 9:29 am
by pandeesh
hi,

I am having the below data in my source sequential file:

Code: Select all

1
3
2
1
2
4
i need to remove duplicates only using transformer stage

Code: Select all


Seq.file--->xfm--->seqfile

My target sequential file should contain below data:

Code: Select all

1
3
2
4
Please help me to achieve this.
The job is whether server or parallel doesn't matter.
But i need to use only transformer for removing duplicates.


Thanks

Posted: Mon Oct 03, 2011 11:07 am
by chulett
Server job with a hashed file that just stores the key or record number or whatever uniquely identifies a record. Do a lookup to the hashed file and if the key does not exist, write it to the hashed file and pass it out the output link. If you get a hit on the hashed file, do nothing, as in do not pass the record through nor update the hashed file.

Note that either the hashed file must not be cached or the cache must be "locked for updates". I prefer the former approach. Also note that the write to the hashed file must be in the same transformer that does the lookup to ensure the locks (if you take that approach) are handled appropriately.

Posted: Mon Oct 03, 2011 11:25 am
by pandeesh
Thanks craig!
is there any way to achieve the same using parallel job?

thanks

Re: Removing Duplicates only using transformer stage

Posted: Mon Oct 03, 2011 12:46 pm
by ray.wurlod
pandeesh wrote:But i need to use only transformer for removing duplicates.
Why?

Posted: Mon Oct 03, 2011 3:52 pm
by pandeesh
Just I would like to know whether it's possible .

Posted: Mon Oct 03, 2011 5:12 pm
by SURA
Server job is easy there no special work from our side, whereas in PX you need to sort it, use stages like remove dup, TFM etc.

DS User

Posted: Mon Oct 03, 2011 5:26 pm
by ray.wurlod
pandeesh wrote:Just I would like to know whether it's possible .
The answer to that is "yes". But why?

The philosophy of parallel jobs is basically one task, one stage type. That's why there are so many more stage types than server jobs have.

Posted: Mon Oct 03, 2011 6:06 pm
by chulett
I'm assuming this would be complicated on the PX side by the (apparent) need to retain the original input order... which leads me to think the dreaded "fork join" design would be appropriate in that case. Somehow. :wink:

Posted: Mon Oct 03, 2011 10:08 pm
by ray.wurlod
Not at all. Set up a stable, unique sort on the input link to the Transformer stage and map the columns across the stage.

Posted: Tue Oct 04, 2011 7:50 am
by vishal_rastogi
hi
for parallel you can use the stage variables
var1=link1
var2=if var1 = var3 then 1 else 0
var3= link1

Posted: Tue Oct 04, 2011 7:54 am
by chulett
In either version you can use stage variables... as long as the input is sorted in a usable fashion.