Database stages used for lookup and pass through args

sumesh.abraham · Post by **sumesh.abraham** » Thu Jul 28, 2011 12:16 pm

We have a server job that uses 5 look ups (joining multiple tables in the SQL query,
many of the tables used in queries are huge with many hundred thousand records)
and we have noticed severe performance degradation. I created a parallel job and used Oracle Connector stage for the look up.
Since in DRS stage in server jobs, we can use pass through paramter in the SQL query

e.g:

SELECT col1,
col2,
col3,
....
FROM table1

WHERE col1 = ?

I am not seeing any option to use this kind of WHERE clause in Oracle Connector stage or Oracle Enterprise stage.
If we query the table without any WHERE clause like this, would this not be a hit to query performance and hence the job execution?
Is there a way to implement this in parallel jobs?

chulett · Post by **chulett** » Thu Jul 28, 2011 2:10 pm

The where clause is automatically generated for any columns marked as a "key".

sumesh.abraham · Post by **sumesh.abraham** » Sun Jul 31, 2011 10:02 pm

Thanks for the reply.
Sorry, I am missing something here.
My design is like this.

sequential file
.
.
Transformer stage
.
.
Oracle connector stage .... Lookup stage
.
.
sequential file

What I am doing in the job is like this. I have defined Oracle connector stage and loaded the column definition by importing the table definition using Import -> Table fenitions -> Orchestrate schema definitions. I have a column marked as the key column. For select statement, I am providing the select satement with the columns that I want. So I am not able to see any where clause generated for the key column. I am basically trying to look up the value for one column based on the where condition on key column for each record. How will I be able to get the where clause et added automatically? Please suggest.

ray.wurlod · Post by **ray.wurlod** » Sun Jul 31, 2011 10:09 pm

Is this a normal Lookup or a sparse Lookup?

What stage type is feeding the reference link into the Lookup stage?

sumesh.abraham · Post by **sumesh.abraham** » Sun Jul 31, 2011 10:17 pm

This is a normal lookup and I am using Oracle connector stage to feed data to the lookup stage

ray.wurlod · Post by **ray.wurlod** » Sun Jul 31, 2011 10:50 pm

In a normal lookup you will never see SQL, because SQL is not used to perform the lookup. Instead, the entire reference data set (partitioned as specified) is loaded into memory and an in-memory index built "on the fly" on the designated key, whether one or more than one column. It is this index that is used to determine whether or not the key value (as mapped in the Lookup stage) exists in the reference data set. If the key exists, then the corresponding row is returned (or more than one row, if that option is enabled). If the key does not exist in the reference data set, NULL is returned for every column defined on the reference input link.

sumesh.abraham · Post by **sumesh.abraham** » Sun Jul 31, 2011 11:00 pm

Thanks for your replies.

The existing job is written in server and uses DRS stage for lookup. in DRS stage, where clause is getting included in the SQL and lookup is working as expected since there will be only one value to be matched with key column value. Since I am unable to have this same logic implemented in parallel job via a Connector stage, warning as 'Ignoring duplicate entry, no more warnings will be issued for this table' error is thrown. When I compare the output file generated by server job and my new parallel job, I do see discrepancies for column values mapped as a result of lookup.

In this case, what would you suggest for lookup in the parallel job?.

ray.wurlod · Post by **ray.wurlod** » Mon Aug 01, 2011 4:06 am

The stages you're using are OK. What partitioning are you using, and what are the lookup keys?

sumesh.abraham · Post by **sumesh.abraham** » Mon Aug 01, 2011 5:07 am

I am using Auto partitioning in the job.

ray.wurlod · Post by **ray.wurlod** » Mon Aug 01, 2011 3:38 pm

Then you should be OK. Failed lookup rule should be set to Continue, in which case any column mapped from the reference input link will be NULL if the lookup failed. There is an option for returning more than one row from a lookup, should you require that.

sumesh.abraham · Post by **sumesh.abraham** » Mon Aug 01, 2011 4:13 pm

I am using 'Continue' for lookup failure in lookup stage.

I am expecting multiple records matching lookup. Currently in the server job,using DRS stage the select query is like

select col-list
from tab1, tab2,...tabn
where
tab1.col1=? and tab1.col2 is not null and rowid=1

In this case, always the first record among a list of matching records for the key is used to map to output link. In my case with the parallel job, even if I set 'Multiple rows returned from link' for my lookup link, how can I make sure that record corresponding to rownum=1 is returned by the lookup as I am selecting all records without any where clause? Since the query in server job can have the where cluase for the key, rownum=1 would make sure that the expected record is fetched, but I am not seeing any such way in parallel job implementation.

Is there a way to have multiple records returned by the lookup and still able to map the value corresponding to first record from the set of matching records to be mapped to output link?

sumesh.abraham · Post by **sumesh.abraham** » Thu Aug 04, 2011 10:56 am

Looks like the WHERE caluse would be added/modified only if Partitioned read is enabled. The Gurus here can confirm this.
I could see the following from Infosphere Information Server documentation at http://publib.boulder.ibm.com/infocente ... reads.html

For all partitioned read methods except the Oracle partitions method, the connector modifies the WHERE clause in the specified SELECT statement. If the WHERE clause is not included in the specified SELECT statement, the connector adds a WHERE clause.

By the way I managed to find a work around by modifying the source query for lookup by ordering by the key value and then use a remove duplicates stage to retain only the first duplicate. The parallel job runs consistently within 2 minutes for past few day's run compared to 3 hours by the server job.