Removing Duplicates

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
haryedru
Participant
Posts: 4
Joined: Sat Nov 19, 2011 2:25 am
Location: India

Removing Duplicates

Post by haryedru »

Hi this is my input
ID,Term
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
2,Account
3,Account_Currency_Code
3,Account_Currency_Code
4,Account_Analyst

This should be my output

ID,TERM
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
3,Account_Currency_Code
3,Account_Currency_Code

If a record is repeating i need that in the output.
Can anyone help me on this. I tried using stage variables in transformer but i am getting 1 record less like this

ID,TERM
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
3,Account_Currency_Code

thanks in advance
Hari
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Show us what you tried. This is readily do-able; particularly easily if the data have been processed through a Sort stage that generates a Key Change column (assuming you're using a parallel job, which is not certain). You can, of course, do the key change test using stage variables.

Use more stage variables to detect that the key change value is 0 and the other field is the same as that in the previous row.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
uegodawa
Participant
Posts: 71
Joined: Thu Apr 27, 2006 12:46 pm

Post by uegodawa »

You can sort the data and passed it to a Aggregator stage by grouping ID and TERM then defined a calculated field count. Next use a transformer stage [Constraint] to pass the records where Count() > 1
Thanks,
Upul
haryedru
Participant
Posts: 4
Joined: Sat Nov 19, 2011 2:25 am
Location: India

Post by haryedru »

Hi Upul and Ray sorry for giving the requirment wrong.

this is my input
ID,Term
1,CONTRACT_NUMBER
2,CONTRACT_NUMBER
3,CONTRACT_NUMBER
4,Account
5,Account_Currency_Code
6,Account_Currency_Code
7,Account_Analyst

This should be my output

ID,TERM
1,CONTRACT_NUMBER
2,CONTRACT_NUMBER
3,CONTRACT_NUMBER
5,Account_Currency_Code
6,Account_Currency_Code

@ray to your question in transformer i used
currentkey=inputlink.col1
previouskey=currentkey
stagevar3: (currentkey=previouskey)
Hari
manoj_23sakthi
Participant
Posts: 47
Joined: Tue Feb 23, 2010 12:16 am
Location: CHENNAI

Post by manoj_23sakthi »

Hi,

If your input is data base u can achive this in query:
with (Term1,Term2 ) as (select Term Term1 ,count (Term ) Term2 from Table group by Term2 > 1)

select ID,Term from table where Term = Term1

Else..
follow uegodawa idea...
Regards
Manoj kumar
chandra.shekhar@tcs.com
Premium Member
Premium Member
Posts: 353
Joined: Mon Jan 17, 2011 5:03 am
Location: Mumbai, India

Post by chandra.shekhar@tcs.com »

Use these variables in the Tfr
input.Col = svOldKey
If svOldKey = svNewKey then svCount1 = svCount1+1 else 1 = svCount1
svOldKey = svNewKey

And in the output use constraint as sv2 > 1
Thats it.
Thanx and Regards,
ETL User
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

haryedru wrote:Hi Upul and Ray sorry for giving the requirment wrong.
So this means a simple transformer stage variable approach cannot be used. You'll need to fall back on the dreaded "fork join" design with the aggregator brach getting counts per key and then when joining back together only passing records with keys that have a count > 1.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

Could you do it using a sort stage and key change column?

Sort the data coming in the required way, by the looks of it, by ID and TERM.

Set a key change column to happen on the TERM only.

On the output of the sort stage, sort the data again, but sort the keyChange column ascending also.

So at this point the data would look like

Code: Select all

ID,Term, keyChange 
3,CONTRACT_NUMBER,0 
2,CONTRACT_NUMBER,0
1,CONTRACT_NUMBER,1 
4,Account,1 
6,Account_Currency_Code,0 
5,Account_Currency_Code,1 
7,Account_Analyst,1 
In a transformer, have a couple of stage variables, one to hold the current term, and one to hold the previous term, and another to determine whether to write the record.

Code: Select all

CurrentTermSV = Input.Term
WriteRecordSV = If keyChange = 0 then @TRUE else if currentTermSV=PrevTermSV then @ TRUE else @FALSE
PrevTermSV = Input.Term

Set the output constraint to 'WriteRecordSV'
haryedru
Participant
Posts: 4
Joined: Sat Nov 19, 2011 2:25 am
Location: India

Post by haryedru »

chulett wrote:
haryedru wrote:Hi Upul and Ray sorry for giving the requirment wrong.
So this means a simple transformer stage variable approach cannot be used. You'll need to fall back on the dreaded "fork join" design with the aggregator brach getting counts per key and then when joining back together only passing records with keys that have a count > 1.
Hi Chulett

It worked thanks for the help...
Hari
Post Reply