RCP In remove duplicates and to pass the keys as param

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
nansekar
Participant
Posts: 29
Joined: Mon Nov 03, 2008 6:23 am

RCP In remove duplicates and to pass the keys as param

Post by nansekar »

Hello Team

I want to have a generic job with RCP enabled with 3 stages,
source: dataset
Target : Netezza DB stage
and a Remove duplicate stage with RCP enabled.

i want to pass the keys to the remove duplciates as a parameter

for eg: if i run for a table A, which has 2 cols as key, it has to be passed as param.
for the 2nd run i have table B which has 3 cols as key and need to remove duplcaites based on these 3 cols..

Is it possible to have such design .

This is a job desgin for migrating the database from DB2 to Netezza,
In Db2 we had unique constraints which handled the rejects of duplcaites, whereas in Netezza such a optionw as not possible.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You'll probably need a Modify stage to materialize the key column names.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
SettValleyConsulting
Premium Member
Premium Member
Posts: 72
Joined: Thu Sep 04, 2003 5:01 am
Location: UK & Europe

Post by SettValleyConsulting »

This is something I've been musing on for some time - there are numerous occasions where I'd like to be able to use RCP and identify key column names as parameters, for sorting, de-duping, change capture etc.

The ideal solution would be a reusable component - probably a shared container - with the following design aims, Let's say we want to cater for a maximum of 5 possible key columns ... -

A component which takes a variable number of column names and moves the corresponding column values from an RCP-enabled data stream into named columns KEY01, KEY02 ... KEY05. These will probably need to be Varchar to accept various input data types. Any unused key cols should be set to empty string.

Its the 'variable number' that gives the greatest difficulty. I've come up with 2 possible approaches, using Modify and Column Export plus schema file, neither of which is 'finished' and I've been fooling around with both in my 'spare time'...


1 . Using Modify - pass parameters as pKEY01, pKEY02 etc.

Step 1 A column generator adds a column to the datastream set to null (percent null = 100). This is used to populate any of the 5 keys that we don't need.

Step 2 A Modify Stage with 5 ouput columns declared: KEY01, KEY02 etc type Varchar. Taking the case where there are 2 key columns, I have three Specifications, passed in as parameters...

Specification=KEY01:string=#pKEY01#;KEY02:string=#pKEY02#;
Specification=#pKEY01#=#pKEY01#;#pKEY02#=#pKEY02#;
Specification=KEY03:string=handle_null(NullColumn, "");KEY04:string=handle_null(NullColumn, "");KEY05:string=handle_null(NullColumn, "");

The first Spec maps the 2 key column values to KEY01 and KEY02
The second tells the Modify to retain the original key columns (you may not need this) If you don't do this Modify assumes you are renaming the columns.
The third sets the unused key column slots to ""

This works, but clearly this particular set of Specs only works for the case of 2 key cols. I've not cracked making it truly generic, I thought of a shared container which sends the data stream to one of 5 Modify stages with the Specifications hard-coded for 1 Key col, 2 key col etc with a filter or transformer sending the data to the appropriate Modify, or 5 different SCs. Neither is particularly elegant.

I tried a parameter set with the three specifications as parameters and Value files for 1 key, 2 keys etc. with the intention that the user would spcify the appropriate value file, but it seems that if you do this Datastage does not resolve the parameter names, that is the Modify sees the literal '#pKEY01#' in the specification, rather than the resolved column name, which is a shame.


2. Using a schema file and Column Export.

[This is not my idea but its in use at my current site] Basically there's some code that dynamically generates a schema file which lists the key columns from the key parameters then uses this in a column export stage to move the key columns into a single Key Column. The Col Export properties are

Column Method = Schema File
Schema File = <Name of dynamic schema file>
Export Column Type = VarChar
Export Column Name = tempKeyColumn

In the format tab the delimiter is set to comma, so we end up with the keys in a single column, delimited by comma.


As often seems to be the case, I've come up against limitations in the DS product. The Modify stage doesn't seem to resolve parameters passed in from a parameter set, and the Column Export doesn't accept parameters for the column name property....

Any thoughts gratefully received...
Phil Clarke
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Faced a somewhat similar quandary while trying to get a key to perform a generic delete. Reading through this might help. In the end, the solution to my problem was rather simple.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
Post Reply