Removing Duplicates only using transformer stage

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Removing Duplicates only using transformer stage

Post by pandeesh »

hi,

I am having the below data in my source sequential file:

Code: Select all

1
3
2
1
2
4
i need to remove duplicates only using transformer stage

Code: Select all


Seq.file--->xfm--->seqfile

My target sequential file should contain below data:

Code: Select all

1
3
2
4
Please help me to achieve this.
The job is whether server or parallel doesn't matter.
But i need to use only transformer for removing duplicates.


Thanks
pandeeswaran
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Server job with a hashed file that just stores the key or record number or whatever uniquely identifies a record. Do a lookup to the hashed file and if the key does not exist, write it to the hashed file and pass it out the output link. If you get a hit on the hashed file, do nothing, as in do not pass the record through nor update the hashed file.

Note that either the hashed file must not be cached or the cache must be "locked for updates". I prefer the former approach. Also note that the write to the hashed file must be in the same transformer that does the lookup to ensure the locks (if you take that approach) are handled appropriately.
-craig

"You can never have too many knives" -- Logan Nine Fingers
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

Thanks craig!
is there any way to achieve the same using parallel job?

thanks
pandeeswaran
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: Removing Duplicates only using transformer stage

Post by ray.wurlod »

pandeesh wrote:But i need to use only transformer for removing duplicates.
Why?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

Just I would like to know whether it's possible .
pandeeswaran
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Server job is easy there no special work from our side, whereas in PX you need to sort it, use stages like remove dup, TFM etc.

DS User
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

pandeesh wrote:Just I would like to know whether it's possible .
The answer to that is "yes". But why?

The philosophy of parallel jobs is basically one task, one stage type. That's why there are so many more stage types than server jobs have.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm assuming this would be complicated on the PX side by the (apparent) need to retain the original input order... which leads me to think the dreaded "fork join" design would be appropriate in that case. Somehow. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not at all. Set up a stable, unique sort on the input link to the Transformer stage and map the columns across the stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vishal_rastogi
Participant
Posts: 47
Joined: Thu Dec 09, 2010 4:37 am

Post by vishal_rastogi »

hi
for parallel you can use the stage variables
var1=link1
var2=if var1 = var3 then 1 else 0
var3= link1
Vish
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

In either version you can use stage variables... as long as the input is sorted in a usable fashion.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply