Removing Duplicates

haryedru · Post by **haryedru** » Tue Nov 22, 2011 12:19 pm

Hi this is my input
ID,Term
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
2,Account
3,Account_Currency_Code
3,Account_Currency_Code
4,Account_Analyst

This should be my output

ID,TERM
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
3,Account_Currency_Code
3,Account_Currency_Code

If a record is repeating i need that in the output.
Can anyone help me on this. I tried using stage variables in transformer but i am getting 1 record less like this

ID,TERM
1,CONTRACT_NUMBER
1,CONTRACT_NUMBER
3,Account_Currency_Code

thanks in advance

ray.wurlod · Post by **ray.wurlod** » Tue Nov 22, 2011 2:18 pm

Show us what you tried. This is readily do-able; particularly easily if the data have been processed through a Sort stage that generates a Key Change column (assuming you're using a parallel job, which is not certain). You can, of course, do the key change test using stage variables.

Use more stage variables to detect that the key change value is 0 and the other field is the same as that in the previous row.

uegodawa · Post by **uegodawa** » Tue Nov 22, 2011 2:31 pm

You can sort the data and passed it to a Aggregator stage by grouping ID and TERM then defined a calculated field count. Next use a transformer stage [Constraint] to pass the records where Count() > 1

haryedru · Post by **haryedru** » Tue Nov 22, 2011 11:49 pm

Hi Upul and Ray sorry for giving the requirment wrong.

this is my input
ID,Term
1,CONTRACT_NUMBER
2,CONTRACT_NUMBER
3,CONTRACT_NUMBER
4,Account
5,Account_Currency_Code
6,Account_Currency_Code
7,Account_Analyst

This should be my output

ID,TERM
1,CONTRACT_NUMBER
2,CONTRACT_NUMBER
3,CONTRACT_NUMBER
5,Account_Currency_Code
6,Account_Currency_Code

@ray to your question in transformer i used
currentkey=inputlink.col1
previouskey=currentkey
stagevar3: (currentkey=previouskey)

manoj_23sakthi · Post by **manoj_23sakthi** » Wed Nov 23, 2011 2:15 am

Hi,

If your input is data base u can achive this in query:
with (Term1,Term2 ) as (select Term Term1 ,count (Term ) Term2 from Table group by Term2 > 1)

select ID,Term from table where Term = Term1

Else..
follow uegodawa idea...
Regards
Manoj kumar

chandra.shekhar@tcs.com · Wed Nov 23, 2011 4:41 am

Use these variables in the Tfr
input.Col = svOldKey
If svOldKey = svNewKey then svCount1 = svCount1+1 else 1 = svCount1
svOldKey = svNewKey

And in the output use constraint as sv2 > 1
Thats it.

chulett · Post by **chulett** » Wed Nov 23, 2011 8:08 am

haryedru wrote:Hi Upul and Ray sorry for giving the requirment wrong.

So this means a simple transformer stage variable approach cannot be used. You'll need to fall back on the dreaded "fork join" design with the aggregator brach getting counts per key and then when joining back together only passing records with keys that have a count > 1.

ShaneMuir · Post by **ShaneMuir** » Wed Nov 23, 2011 10:08 am

Could you do it using a sort stage and key change column?

Sort the data coming in the required way, by the looks of it, by ID and TERM.

Set a key change column to happen on the TERM only.

On the output of the sort stage, sort the data again, but sort the keyChange column ascending also.

So at this point the data would look like

Code: Select all

ID,Term, keyChange 
3,CONTRACT_NUMBER,0 
2,CONTRACT_NUMBER,0
1,CONTRACT_NUMBER,1 
4,Account,1 
6,Account_Currency_Code,0 
5,Account_Currency_Code,1 
7,Account_Analyst,1

In a transformer, have a couple of stage variables, one to hold the current term, and one to hold the previous term, and another to determine whether to write the record.

Code: Select all

CurrentTermSV = Input.Term
WriteRecordSV = If keyChange = 0 then @TRUE else if currentTermSV=PrevTermSV then @ TRUE else @FALSE
PrevTermSV = Input.Term

Set the output constraint to 'WriteRecordSV'

haryedru · Post by **haryedru** » Thu Nov 24, 2011 4:37 am

chulett wrote:
haryedru wrote:Hi Upul and Ray sorry for giving the requirment wrong.
So this means a simple transformer stage variable approach cannot be used. You'll need to fall back on the dreaded "fork join" design with the aggregator brach getting counts per key and then when joining back together only passing records with keys that have a count > 1.

Hi Chulett

It worked thanks for the help...