Lookup - No of Rows returned

Rahul · Post by **Rahul** » Tue Nov 18, 2003 10:13 am

Hi,

Can anyone let me know if i would be able to figure out if the number of rows returned by a lookup is more than one ?

Regards

Rahul

Peytot · Post by **Peytot** » Tue Nov 18, 2003 10:44 am

DataStage return you the first row that it found. I don't think you can process more than one row by lookup....

Pey

inter5566 · Post by **inter5566** » Tue Nov 18, 2003 12:15 pm

Perhaps you could perform a multi-row lookup (described elsewhere on this forum) and use a stage variable along with a comparison to last row written to add a look-up/sequence number to all output. Then use another transform to count or manipulate the rows based on this sequence number.

Just thinking out loud

Paul Preston · Post by **Paul Preston** » Wed Nov 19, 2003 4:28 am

Yes the proposal of inter5566 seems very good. However, for multi row lookup to work you have to use either an ODBC stage or a universe table.

djbarham · Post by **djbarham** » Wed Nov 19, 2003 5:13 am

Paul Preston wrote:you have to use either an ODBC stage or a universe table.

That is correct. As far as I know, all other stage types can only return zero or one rows. In fact, in some versions of the ORAOCI8 stage, returning more than one row will generate an Oracle error as the OCI stage has explicitly asked Oracle for only one row.

To answer your question for other than UV and ODBC stages, I would need to know the context of why you need to know and what you plan to do with the information in your job.

Rahul · Post by **Rahul** » Wed Nov 19, 2003 8:25 am

Thankyou all for the reply to the post.

Heres what i want to do.

I have col1 and col2. Now the scenario of data could be

col1 col2

1234 NULL
1234 9999
2345 NULL
3456 8888
3456 NULL
6789 NULL

Now what i would like to do is when i lookup, I would like these values to be returned.

1234 9999
2345 NULL
3456 8888
6789 NULL

If you observe col1 for 1234 has values NULL and 9999. So i have to pick up col2 with 9999. Incase i do not have any value on col2 then i have to pick up the one with NULL (as in 2345 and 6789).

Can someone suggest a way around for this.

Thanks

Rahul
I do not want to pick up the ones with

aaronej · Post by **aaronej** » Wed Nov 19, 2003 8:49 am

Rahul,

I would use an ODBC stage along with custom SQL or a stored proc to return only the records you want to lookup against. For example:

Code: Select all

SELECT 
     t.col1,
     t2.col2
FROM
     (SELECT DISTINCT col1 FROM table) AS t
     LEFT OUTER JOIN table AS t2
     ON t.col1 = t2.col1
     AND t2.col2 IS NOT NULL

This code will give you a list of all your col1's and correct information in col2.

Hope this helps.

Aaron

kcbland · Post by **kcbland** » Wed Nov 19, 2003 8:50 am

If your situation is 0, 1, or 2 rows that can be referenced, then simply perform two lookups. The first lookup is WHERE col2 IS NOT NULL and the second lookup is WHERE col2 IS NULL. Simply pick the result you want. If neither row is a hit, then there's no lookup row. If the first lookup is hit, then use it, else use the second. Two queries against the table shouldn't be that bad.

I'd prefer you use a hash file approach myself. You could build a hash file where the primary key is col1 plus another column called col2_whatever. When loading the hash file, derive col2_whatever as

Code: Select all

IF ISNULL(col2) then "NULL" Else "NOTNULL"

.
Then, in your job that will use the hash file as a reference, do two reference lookups like I described above. For the key expression for col2_whatever for the first lookup, put in the literal "NULL". For the second lookup, put in the liternal "NOTNULL". Your two hits against the hash file will be fast, and where the same lookup rows are repeatedly referenced you get better performance than hitting the database.

Rahul · Post by **Rahul** » Wed Nov 19, 2003 9:06 am

Thanks Ken. Infact i was trying with a sample of the way that you had suggested. It works !!!

Thank you all for the responses.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 19, 2003 3:45 pm

You don't need the rows with NULL in col2. The documented behaviour of DataStage reference lookups in server jobs is to return NULL for every column defined on the reference lookup link (and to set that link's NOTFOUND link variable) in the situation where the lookup fails to return any rows.

Neil C · Post by **Neil C** » Wed Nov 19, 2003 4:30 pm

ray.wurlod wrote:You don't need the rows with NULL in col2.... .

Yes you do need the NULL values. What Rahul wants it to return a NULL value with a FOUND condition, if this is the only row availabe for the particular value of col1. If there are rows as specifed above, with NULL and also a value of 8888 in col2, then it is the 8888 that is required. In this case col2=NULL has a distinct meaning, rather than being a NOTFOUND condition.

kcbland · Post by **kcbland** » Wed Nov 19, 2003 4:37 pm

Maybe I wasn't clear enough. The original poster has two columns that identify uniqueness, col1 and col2. The problem is that col1 is the only known value in the primary input stream. Therefore, in a hash file design, the primary key has to be just col1. But, this would cause the potentially two rows to clobber. So, you must have a two part key.

I suggested making the hash file primary key col1 and col2_whatever, where col2_whatever was derived as either a literal "NULL" or "NOTNULL", depending on the value of col2. col2 would become just an attribute. This prevents the two rows from clobbering. You incur a double hit against the hash file for the "NULL" and "NOTNULL" check, but this is preferable to a repeated database hit.

You would do the same thing via an OCI lookup, but for the col2 key expression do a IS NULL in one link and a NOT IS NULL in the other link.

Neil C · Post by **Neil C** » Wed Nov 19, 2003 6:24 pm

Unfortunately, this approach will break down if you have more than one non-null value. What would the best way be to handle data such as:

1234 NULL
1234 5678
1234 6789

The answer will probably be "It depends". Will a generated rowid be useful?

vmcburney · Post by **vmcburney** » Wed Nov 19, 2003 7:27 pm

It's a dangerous situation and it becomes a business decision not a technical decision. If we have an input row with multiple lookup matches which one do we take? Do we take the oldest match, the first match, the match with the highest dollar value etc. In Rahul's situation the business answer is to always take the one where a non key column is not null. If this still produces more than one row he can refine it to always take the highest code, ie choose 9999 ahead of NULL or 8888. I don't think a DataStage programmer should be making this decision without consulting someone who knows the data. Once you know how you are matching your lookup you can usually find a technical solution such as those mentioned here or by aggregating into a new hash file where col1 becomes the primary key.

kcbland · Post by **kcbland** » Wed Nov 19, 2003 8:51 pm

Absolutely, this is a subject matter expert call. I stated a solution for the given data example. It is correctly noted that the business rule has to be absolutely defined.

Also, in my own defense

, I did state:

Code: Select all

If your situation is 0, 1, or 2 rows that can be referenced, then simply perform two lookups

The shown solution works for the situation I stated. If you have 0 to many lookups, then the determination has to be given which you will pick for the col2 value. If it's min/max, then you simply aggregate the data going into the hash file and only pick the max, using an NVL to set the row with a NULL col2 to some extreme min/max value. This can be done via SQL when the data is fetched from the database, or the aggregator stage in the job, depending on the data volume. Now you only have a one row lookup in the hash file.