how we can remove duplicates in transformer stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vamsipx
Participant
Posts: 3
Joined: Wed Aug 22, 2007 11:31 pm

how we can remove duplicates in transformer stage

Post by vamsipx »

hi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There are no duplicates in a Transformer stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vikasjawa
Participant
Posts: 13
Joined: Tue Aug 29, 2006 3:20 am
Location: Gurgaon

Re: how we can remove duplicates in transformer stage

Post by vikasjawa »

Hi,
You can perform remove duplicate functionality in a transformer stage.In the transformer Stage select Hash partitioning on the Key you want to perform remove duplicate. Then write a simple logic in the stage variables to check if the last row had the same key value as the current one and use this as a constraint.
Vikas Jawa
sri.cbv
Participant
Posts: 6
Joined: Tue Nov 06, 2007 3:01 am
Location: chennai

Post by sri.cbv »

ray.wurlod wrote:There are no duplicates in a Transformer stage. ...
hi
:lol: hi ,

You can remove the duplicates in transformer stage using stage variables.
define 3 variables called x ,y , z .assign x and z to zeo then use the if then else condition for B and compare and c .





Thanks
srinivas
SRINIVAS
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Post by saikir »

Hi,

You can remove the duplicates in the transformer using the routine RowProcCompareWithPreviousValue. Sort the ouput and pass the keycolumn as input to the routine. It returns zero if the previous row is same as the current zero.

However, this may be slower then other techniques used to find out duplicates.

Sai
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

RowProcCompareWithPreviousValue in a parallel Transformer? Doubt it.

RowProcCompareWithPreviousValue is written in DataStage BASIC and relies on COMMON, which immediately invalidates it from use in a parallel job. See Chapter 2 of Parallel Job Developer's Guide for more information on the rules.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Post by saikir »

Hi Ray,

Thanks for the correction. Just missed the part that it is parallel but not server.

One small clarification,the documentation states that there is a BASIC Transformer stage where in you can use Basic functions. Can i use the routine in this?

Sai
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I expect, from reading the rules, that the fact that the routine uses COMMON variable would preclude it. Why not try it and let us know?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
harsha_blm
Participant
Posts: 10
Joined: Tue Jun 19, 2007 1:16 am
Location: Bangalore

Re: how we can remove duplicates in transformer stage

Post by harsha_blm »

FYI..

Declare 2 stage variables in the transformer-

Constraint StageVariable
------------ --------------------
svCurr svPrev

Input.colname svCurr


Mention the constraint svCurr<>svPrev in the output
srimitta
Premium Member
Premium Member
Posts: 187
Joined: Sun Apr 04, 2004 7:50 pm

Post by srimitta »

Sort column / columns on which you want to identify duplicates
Call Transformation Routine RowProcCompareWithPreviousValue

Or

Create 3 Stage Vaibles, need to be in order.
StageVariable1 --> svCurr --> <Column Names>
StageVariable2 --> svUni -->If svPrev = svCurr Then 'D' Else 'U'
StageVariable3 --> svPrev --> <Column Names>

In SvCurr & svPrev you need to have SAME columns
if you want identify duplicate on more than one column, do concatinate all columns
svCurr --> COL1 : COL2 : COL3
svUni --> svUni -->If svPrev = svCurr Then 'D' Else 'U'
svPrev --> COL1 : COL2 : COL3

'D' = Duplicate
'U' = Unique

On your Constraint call svUni and define what you want in your out link.
Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.
By William A.Foster
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

On the transformer stage, on the Input tab, on the Partitioning tab, select Hash Partitioning method. Check Sort and Unique. It'll give you what you want.
Post Reply