Stripping Rows from a Sequential File

JezT · Post by **JezT** » Wed Nov 17, 2004 8:39 am

I have a file of about 10,000 rows in with 3 columns (IP_ID, CDU_CIN and PTYP_SVAL). Most of the rows have unique IP_ID's (which is the Key field) but there are some that have duplicates.

I am wanting to strip out these duplicates and to leave the occurence of the IP_ID which has the highest value in the PTYP_SVAL. For example, there are two rows in the file as follows.

Code: Select all

IP_ID, CDU_CIN, PTYP_SVAL
2, 5, 14
2, 5, 15

In this example, I would want to remove the first row and keep the second instance.

If it makes it easier, the data is being extracted from a DB2 table (SQL below). Is it possible to add something to the SQL to not select the unwanted rows ?

Code: Select all

SELECT DISTINCT
 DWH_IP_XREF.IP_ID, 
 DWH_IP_XREF.CDU_CIN, 
 DWH_PTYP_STAT_INFO.PTYP_SVAL  
FROM 
 #pUKDWSCHEMA#.DWH_IP_XREF AS DWH_IP_XREF 
INNER JOIN
  #pUKDWSCHEMA#.DWH_IP_ARRG_RELN AS DWH_IP_ARRG_RELN 
ON  DWH_IP_XREF.IP_ID = DWH_IP_ARRG_RELN.IP_ID 
INNER JOIN
  #pUKDWSCHEMA#.DWH_PRAR AS DWH_PRAR 
ON DWH_IP_ARRG_RELN.ARRG_ID = DWH_PRAR.ARRG_ID 
INNER JOIN
  #pUKDWSCHEMA#.DWH_PTYP_STAT_INFO AS DWH_PTYP_STAT_INFO 
ON DWH_PRAR.PTYP_ID = DWH_PTYP_STAT_INFO.PTYP_ID 
INNER JOIN
  #pUKDWSCHEMA#.DWH_ARRG AS DWH_ARRG 
ON DWH_IP_ARRG_RELN.ARRG_ID = DWH_ARRG.ARRG_ID
WHERE
 DWH_IP_XREF.CDU_CIN IS NOT NULL
AND
 DWH_PRAR.PTYP_ID IS NOT NULL
AND
 DWH_ARRG.ARRG_ENDT IS NULL 
AND
 DWH_PTYP_STAT_INFO.PTYP_STAT_TYCD = 'RANK'
AND
 DWH_PTYP_STAT_INFO.PTYP_SVAL > 0
ORDER BY
 DWH_IP_XREF.IP_ID, 
 DWH_PTYP_STAT_INFO.PTYP_SVAL;

chucksmith · Post by **chucksmith** » Wed Nov 17, 2004 9:48 am

As you almost do, order your data by IP_ID in ascending order, and by PTYP_SVAL in descending order. Create stage variables to indicate if the current IP_ID is the same as the previous IP_ID, and set the output constraint to only pass rows whose IP_ID is NOT the same as the previous IP_ID.

tonystark622 · Post by **tonystark622** » Wed Nov 17, 2004 10:52 am

how do you get the last row to output (if needed). this is an issue that I've run into with using Stage Variables for this kind of solution.

Tony

chucksmith · Post by **chucksmith** » Wed Nov 17, 2004 11:01 am

By ordering PTYP_SVAL in descending order, the row you want is the first row, not the last row.

In general, when you need to capture the last row, you can use the aggregator stage with the LAST action. Your transform could have two links, one for all but the last row, and another for to the aggregator for the last row.

Another option is to write your data to a hash file with a key, like IP_ID. Since a subsequent write replaces the previous contents of the row with the same key value, you are left with the last row for each key value.

My initial suggestion put the overhead of the sort on the database server, but reduces your overall output in the ETL process.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 17, 2004 3:18 pm

You could use QualityStage to perform your de-dup operation. Call it from the QualityStage plug-in. Simply stream the file in to this stage, and receive back the de-dup'ed file for ongoing processing.

rasi · Post by **rasi** » Wed Nov 17, 2004 6:42 pm

Instead of Select Distinct

Use Select ip_id,cdu_cin,max(ptyp_sval)
from table
group by ip_id,cdu_cin

This will extract the rows from the SQL query. That's the most efficient way.

If you want to do this in datastage as Chuck Smith said get the source sorted from the database and use the stage variable to store the previous and compare it with the current value and send the output if it is greater or lesser based on your sort from the source.

Thanks

JezT · Post by **JezT** » Thu Nov 18, 2004 2:44 am

Cheers Rasi

That works a treat !!

Jez