Handling Duplicate Data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsadm_ws
Premium Member
Premium Member
Posts: 65
Joined: Mon Apr 12, 2010 10:30 am

Handling Duplicate Data

Post by dsadm_ws »

Hi All,

I am trying to handle Duplicates with Sort Stage using Create Key Change Column = True.

However, I have some issues:
1) Say I have 103 Columns out of which 6 are duplicate, and I want to Capture these 6 records.

Instead, with Sort stage, it retains the first column and discards the other.

2) Also, I want to retain only those records with the following condition:

a. I have 7 columns out of which 2 are key columns and 5 are value columns.

Emp_id Dept_id Qtr YTD One_YY Three_YY Five_YY Ten_YY
12 20 123.44 -99.00 -99.00 -99.00 125.55 -99.00
12 20 -99.00 -99.00 145.55 567.88 012.45 -99.00

The above is some sample duplicates data.

Requirement is to retain only the record with less number of -99.00 values and discard the others. How can I attain this.
Any Ideas!!

Regards.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Re: Handling Duplicate Data

Post by jwiles »

Use a transformer to count the number of -99.00 columns and add that count as an output column, then sort using the count as the last sort key column (don't partition on it tho). If you sort in ascending order your desired record will be the first within a key group.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
dsadm_ws
Premium Member
Premium Member
Posts: 65
Joined: Mon Apr 12, 2010 10:30 am

Post by dsadm_ws »

Thank You. I usd the stage variables to get the count and Used the sort Stage to extract the least count value.

I really appreciate your help.
Post Reply