Stripping Rows from a Sequential File

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
JezT
Participant
Posts: 75
Joined: Wed Sep 15, 2004 1:52 am

Stripping Rows from a Sequential File

Post by JezT »

I have a file of about 10,000 rows in with 3 columns (IP_ID, CDU_CIN and PTYP_SVAL). Most of the rows have unique IP_ID's (which is the Key field) but there are some that have duplicates.

I am wanting to strip out these duplicates and to leave the occurence of the IP_ID which has the highest value in the PTYP_SVAL. For example, there are two rows in the file as follows.

Code: Select all

IP_ID, CDU_CIN, PTYP_SVAL
2, 5, 14
2, 5, 15
In this example, I would want to remove the first row and keep the second instance.

If it makes it easier, the data is being extracted from a DB2 table (SQL below). Is it possible to add something to the SQL to not select the unwanted rows ?

Code: Select all

SELECT DISTINCT
 DWH_IP_XREF.IP_ID, 
 DWH_IP_XREF.CDU_CIN, 
 DWH_PTYP_STAT_INFO.PTYP_SVAL  
FROM 
 #pUKDWSCHEMA#.DWH_IP_XREF AS DWH_IP_XREF 
INNER JOIN
  #pUKDWSCHEMA#.DWH_IP_ARRG_RELN AS DWH_IP_ARRG_RELN 
ON  DWH_IP_XREF.IP_ID = DWH_IP_ARRG_RELN.IP_ID 
INNER JOIN
  #pUKDWSCHEMA#.DWH_PRAR AS DWH_PRAR 
ON DWH_IP_ARRG_RELN.ARRG_ID = DWH_PRAR.ARRG_ID 
INNER JOIN
  #pUKDWSCHEMA#.DWH_PTYP_STAT_INFO AS DWH_PTYP_STAT_INFO 
ON DWH_PRAR.PTYP_ID = DWH_PTYP_STAT_INFO.PTYP_ID 
INNER JOIN
  #pUKDWSCHEMA#.DWH_ARRG AS DWH_ARRG 
ON DWH_IP_ARRG_RELN.ARRG_ID = DWH_ARRG.ARRG_ID
WHERE
 DWH_IP_XREF.CDU_CIN IS NOT NULL
AND
 DWH_PRAR.PTYP_ID IS NOT NULL
AND
 DWH_ARRG.ARRG_ENDT IS NULL 
AND
 DWH_PTYP_STAT_INFO.PTYP_STAT_TYCD = 'RANK'
AND
 DWH_PTYP_STAT_INFO.PTYP_SVAL > 0
ORDER BY
 DWH_IP_XREF.IP_ID, 
 DWH_PTYP_STAT_INFO.PTYP_SVAL;
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

As you almost do, order your data by IP_ID in ascending order, and by PTYP_SVAL in descending order. Create stage variables to indicate if the current IP_ID is the same as the previous IP_ID, and set the output constraint to only pass rows whose IP_ID is NOT the same as the previous IP_ID.
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

how do you get the last row to output (if needed). this is an issue that I've run into with using Stage Variables for this kind of solution.

Tony
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

By ordering PTYP_SVAL in descending order, the row you want is the first row, not the last row.

In general, when you need to capture the last row, you can use the aggregator stage with the LAST action. Your transform could have two links, one for all but the last row, and another for to the aggregator for the last row.

Another option is to write your data to a hash file with a key, like IP_ID. Since a subsequent write replaces the previous contents of the row with the same key value, you are left with the last row for each key value.

My initial suggestion put the overhead of the sort on the database server, but reduces your overall output in the ETL process.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could use QualityStage to perform your de-dup operation. Call it from the QualityStage plug-in. Simply stream the file in to this stage, and receive back the de-dup'ed file for ongoing processing.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Instead of Select Distinct

Use Select ip_id,cdu_cin,max(ptyp_sval)
from table
group by ip_id,cdu_cin

This will extract the rows from the SQL query. That's the most efficient way.

If you want to do this in datastage as Chuck Smith said get the source sorted from the database and use the stage variable to store the previous and compare it with the current value and send the output if it is greater or lesser based on your sort from the source.

Thanks
Regards
Siva

Listening to the Learned

"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
JezT
Participant
Posts: 75
Joined: Wed Sep 15, 2004 1:52 am

Post by JezT »

Cheers Rasi

That works a treat !!

Jez
Post Reply