To reject all my duplicates

suresh.narasimha · Post by **suresh.narasimha** » Mon Jan 22, 2007 7:10 am

Hi EveryBody,

I have a Sequential File ===> XFM1 ====>XFM2 .....

I have a reject file to the XFM1 and i'm picking up the first occurance for my duplicates, the rest are rejected.

Now my requirement is that i want to pick my first occurance and wanted to reject all my duplicates.

How can i do that ?

Suppose i have my data like this
Col1 Col2
10 200
10 300
10 400

Now my output should have

Col1 Col2
10 200

and my reject file should have

Col1 Col2
10 200
10 300
10 400

Thanks In Advance,
Suresh N

suresh.narasimha · Post by **suresh.narasimha** » Mon Jan 22, 2007 7:22 am

Sorry Small Correction.

I have a Sequential File ===>AGG1 ====>XFM2 .....

I have a reject file to the AGG1 and i'm picking up the first occurance for my duplicates, the rest are rejected.

Now my requirement is that i want to pick my first occurance and wanted to reject all my duplicates.

How can i do that ?

Suppose i have my data like this
Col1 Col2
10 200
10 300
10 400

Now my output should have

Col1 Col2
10 200

and my reject file should have

Col1 Col2
10 200
10 300
10 400

Thanks ,
Suresh N

DSguru2B · Post by **DSguru2B** » Mon Jan 22, 2007 7:32 am

The first part is easy, pass it through the aggregator, grouping on Col1 and provide 'First' as the derivation for Col2.
As for your second requirement, i have a followup question:
Do you want 10,200(in your sample data) to be in your reject file as well???

suresh.narasimha · Post by **suresh.narasimha** » Mon Jan 22, 2007 10:41 pm

Yes Guru you are correct, I need 10,200 in sample data reject file.

Regards,
Suresh N

narasimha · Post by **narasimha** » Mon Jan 22, 2007 11:17 pm

Would you call it a reject file if you want 10,200 also ?
In such a condition your source and reject file will have the same data all the time.
Did I miss something here?

suresh.narasimha · Post by **suresh.narasimha** » Tue Jan 23, 2007 1:42 am

Hi Narasimha,

You are correct infact. But this is our requirement we need to do.

Please give me some start up idea.

Regards,
Suresh N

elavenil · Post by **elavenil** » Tue Jan 23, 2007 2:15 am

Suresh,

If that is the req, as Narasimha highlighted there is no difference between the source & target. The first row can be sent to the output from the aggregator and source file itself can be shown as reject records.

Regards
Elavenil

ray.wurlod · Post by **ray.wurlod** » Tue Jan 23, 2007 2:23 am

I would use a Transformer stage to identify and remove duplicates from one output, and direct all input rows to another output (the "rejects"). This approach requires sorted input.

DSguru2B · Post by **DSguru2B** » Tue Jan 23, 2007 3:00 pm

Do this.
Sort the incoming data on your key. Define two stage variables in the transformer, say condFlag and prevVal. The will basically detect duplicates and flag them. Their both will be initialized to 0. Their derivation will be as follows:

Code: Select all

condFlag  | if (prevVal <> src.key) then 'X' else 'Y'
prevVal   | src.key

Have two links coming out of the transformer. Say Trg and buildHash. Trg will be going to your flat file or database. buildHash will go to a hashed file keyed on your first column (key).
Constraint for Trg: condFlag = 'X'
Constraint for buildHash: condFlag = 'Y'

In the same job or maybe a second job feed the same source file and do a lookup on this hashed file keyed on your first column (key). Provide the constraint as NOT(reflink.NOTFOUND) where reflink is your reference link name. The output of this second job will give you your reject file which will have all the records which are duplicates based on key.