Lookup - No of Rows returned

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Rahul
Participant
Posts: 19
Joined: Wed Oct 29, 2003 10:21 pm

Lookup - No of Rows returned

Post by Rahul »

Hi,

Can anyone let me know if i would be able to figure out if the number of rows returned by a lookup is more than one ?


Regards

Rahul
Peytot
Participant
Posts: 145
Joined: Wed Jun 04, 2003 7:56 am
Location: France

Post by Peytot »

DataStage return you the first row that it found. I don't think you can process more than one row by lookup....

Pey
inter5566
Premium Member
Premium Member
Posts: 57
Joined: Tue Jun 10, 2003 1:51 pm
Location: US - Midwest

Post by inter5566 »

Perhaps you could perform a multi-row lookup (described elsewhere on this forum) and use a stage variable along with a comparison to last row written to add a look-up/sequence number to all output. Then use another transform to count or manipulate the rows based on this sequence number.

Just thinking out loud 8)
Paul Preston
Participant
Posts: 24
Joined: Wed Apr 02, 2003 7:09 am
Location: United Kingdom

Post by Paul Preston »

Yes the proposal of inter5566 seems very good. However, for multi row lookup to work you have to use either an ODBC stage or a universe table.
djbarham
Participant
Posts: 34
Joined: Wed May 07, 2003 4:39 pm
Location: Brisbane, Australia

Post by djbarham »

Paul Preston wrote:you have to use either an ODBC stage or a universe table.
That is correct. As far as I know, all other stage types can only return zero or one rows. In fact, in some versions of the ORAOCI8 stage, returning more than one row will generate an Oracle error as the OCI stage has explicitly asked Oracle for only one row.

To answer your question for other than UV and ODBC stages, I would need to know the context of why you need to know and what you plan to do with the information in your job.
Rahul
Participant
Posts: 19
Joined: Wed Oct 29, 2003 10:21 pm

Post by Rahul »

Thankyou all for the reply to the post.

Heres what i want to do.

I have col1 and col2. Now the scenario of data could be

col1 col2

1234 NULL
1234 9999
2345 NULL
3456 8888
3456 NULL
6789 NULL

Now what i would like to do is when i lookup, I would like these values to be returned.

1234 9999
2345 NULL
3456 8888
6789 NULL

If you observe col1 for 1234 has values NULL and 9999. So i have to pick up col2 with 9999. Incase i do not have any value on col2 then i have to pick up the one with NULL (as in 2345 and 6789).

Can someone suggest a way around for this.


Thanks

Rahul
I do not want to pick up the ones with
aaronej
Participant
Posts: 31
Joined: Mon Aug 18, 2003 9:25 am

Post by aaronej »

Rahul,

I would use an ODBC stage along with custom SQL or a stored proc to return only the records you want to lookup against. For example:

Code: Select all

SELECT 
     t.col1,
     t2.col2
FROM
     (SELECT DISTINCT col1 FROM table) AS t
     LEFT OUTER JOIN table AS t2
     ON t.col1 = t2.col1
     AND t2.col2 IS NOT NULL
This code will give you a list of all your col1's and correct information in col2.

Hope this helps.

Aaron
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If your situation is 0, 1, or 2 rows that can be referenced, then simply perform two lookups. The first lookup is WHERE col2 IS NOT NULL and the second lookup is WHERE col2 IS NULL. Simply pick the result you want. If neither row is a hit, then there's no lookup row. If the first lookup is hit, then use it, else use the second. Two queries against the table shouldn't be that bad.

I'd prefer you use a hash file approach myself. You could build a hash file where the primary key is col1 plus another column called col2_whatever. When loading the hash file, derive col2_whatever as

Code: Select all

IF ISNULL(col2) then "NULL" Else "NOTNULL"
.
Then, in your job that will use the hash file as a reference, do two reference lookups like I described above. For the key expression for col2_whatever for the first lookup, put in the literal "NULL". For the second lookup, put in the liternal "NOTNULL". Your two hits against the hash file will be fast, and where the same lookup rows are repeatedly referenced you get better performance than hitting the database.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rahul
Participant
Posts: 19
Joined: Wed Oct 29, 2003 10:21 pm

Post by Rahul »

Thanks Ken. Infact i was trying with a sample of the way that you had suggested. It works !!!


Thank you all for the responses. :)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You don't need the rows with NULL in col2. The documented behaviour of DataStage reference lookups in server jobs is to return NULL for every column defined on the reference lookup link (and to set that link's NOTFOUND link variable) in the situation where the lookup fails to return any rows.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Neil C
Participant
Posts: 46
Joined: Sun Apr 06, 2003 8:59 pm
Location: Auckland, New Zealand

Post by Neil C »

ray.wurlod wrote:You don't need the rows with NULL in col2.... .
Yes you do need the NULL values. What Rahul wants it to return a NULL value with a FOUND condition, if this is the only row availabe for the particular value of col1. If there are rows as specifed above, with NULL and also a value of 8888 in col2, then it is the 8888 that is required. In this case col2=NULL has a distinct meaning, rather than being a NOTFOUND condition.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Maybe I wasn't clear enough. The original poster has two columns that identify uniqueness, col1 and col2. The problem is that col1 is the only known value in the primary input stream. Therefore, in a hash file design, the primary key has to be just col1. But, this would cause the potentially two rows to clobber. So, you must have a two part key.

I suggested making the hash file primary key col1 and col2_whatever, where col2_whatever was derived as either a literal "NULL" or "NOTNULL", depending on the value of col2. col2 would become just an attribute. This prevents the two rows from clobbering. You incur a double hit against the hash file for the "NULL" and "NOTNULL" check, but this is preferable to a repeated database hit.

You would do the same thing via an OCI lookup, but for the col2 key expression do a IS NULL in one link and a NOT IS NULL in the other link.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Neil C
Participant
Posts: 46
Joined: Sun Apr 06, 2003 8:59 pm
Location: Auckland, New Zealand

Post by Neil C »

Unfortunately, this approach will break down if you have more than one non-null value. What would the best way be to handle data such as:

1234 NULL
1234 5678
1234 6789

The answer will probably be "It depends". Will a generated rowid be useful?
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

It's a dangerous situation and it becomes a business decision not a technical decision. If we have an input row with multiple lookup matches which one do we take? Do we take the oldest match, the first match, the match with the highest dollar value etc. In Rahul's situation the business answer is to always take the one where a non key column is not null. If this still produces more than one row he can refine it to always take the highest code, ie choose 9999 ahead of NULL or 8888. I don't think a DataStage programmer should be making this decision without consulting someone who knows the data. Once you know how you are matching your lookup you can usually find a technical solution such as those mentioned here or by aggregating into a new hash file where col1 becomes the primary key.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Absolutely, this is a subject matter expert call. I stated a solution for the given data example. It is correctly noted that the business rule has to be absolutely defined.

Also, in my own defense 8) , I did state:

Code: Select all

If your situation is 0, 1, or 2 rows that can be referenced, then simply perform two lookups
The shown solution works for the situation I stated. If you have 0 to many lookups, then the determination has to be given which you will pick for the col2 value. If it's min/max, then you simply aggregate the data going into the hash file and only pick the max, using an NVL to set the row with a NULL col2 to some extreme min/max value. This can be done via SQL when the data is fetched from the database, or the aggregator stage in the job, depending on the data volume. Now you only have a one row lookup in the hash file.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply