Stored Proc/Unix Script/DS Routine : which is better ?

abby · Post by **abby** » Wed Jun 21, 2006 12:18 am

Hi All,

I'm facing an issue to perform a validation in my job.

The job reads an input file and validate one of its fields against a look-up table. While validating I need to see if this value exists in the look-up table and if so, I need to fetch other column's value from the lookup table and mark the record as valid else the record in question is invalid. Also, the input file size will be of the order of 70,000 records.

Now, I have three options to perform this validation :

a. Call a Strored Procedure for each input row and get the validation done ( I have a readymade stord proc for this, which needs some parameters like the value to be looked-up,the look-up table name and its columns).

b. Call a Unix Script which would perform the validation against the extracted data from the look-up table and return me the desired value. This return value will help me decide to accept or reject the record.

c. Call a DS Routine from the job which will validate the data as required.

Issues I anticipate with these approaches are:

First approach : May result in job hanging as it calls a DB object from job as many times as the number of records (here 70,000) since we are already facing job hanging issues even in simple select queries. Also,I need to pass the input link data to stored procedure as parameter value. Not sure how to do this.

Second approach : Is it advisable to call a unix script thousand times for validating data.Also, this again requires to pass the input link data as a parameter value to the unix script. Will there be any job performance issue?

Third approach : It will require reading file (look-up file data) from within a routine to perform validation against input link data. Is it advisable to use file handling withing DS Routines?

Please advise which is a better option.

ray.wurlod · Post by **ray.wurlod** » Wed Jun 21, 2006 1:01 am

Fourth, and best, option is to use DataStage.

A standard lookup will not only determine existence but also return any columns from that row that you specify.

Based on the success of failure of the lookup, you can make any further decision required such as setting a value in an IsValid variable or column.

The input file size is irrelevant; they are processed one at a time in a stream.

Best performance will be had if you pre-load the 70,000 records into a hashed file - but load only the columns that you will actually need - the key and the column whose value you want to return. That way you will have a small row size and therefore a small hashed file, which is more likely to be cached in memory successfully.