Page 1 of 1

Lookup or Join or Merge

Posted: Wed Nov 01, 2006 3:29 pm
by kool78
Hi Gurus,

i have to perform a lookup on a table(lookup with 4 columns)....

the reference stream has data of around 6million rows..

what i was thinking was to use merge stage(as i wanted to handle the rejects as well) instead of lookup stage(which would reduce the performance)...

i think i can also prefer join stage but rejects can't be handled in join stage..so thought of using merge stage...

i thought of posting it before implementing the logic , as i wanted to get some advices or different possible solutions from dsx gurus..

any kind of help is appreciated..

thanks n regards

Re: Lookup or Join or Merge

Posted: Wed Nov 01, 2006 3:43 pm
by devnull
kool78 wrote:Hi Gurus,

i have to perform a lookup on a table(lookup with 4 columns)....

the reference stream has data of around 6million rows..

what i was thinking was to use merge stage(as i wanted to handle the rejects as well) instead of lookup stage(which would reduce the performance)...

i think i can also prefer join stage but rejects can't be handled in join stage..so thought of using merge stage...

i thought of posting it before implementing the logic , as i wanted to get some advices or different possible solutions from dsx gurus..

any kind of help is appreciated..

thanks n regards
How did you want to handle the rejects? You could use an outerjoin in a join stage and filter the null lookup fields to a reject link.

Posted: Wed Nov 01, 2006 11:28 pm
by ray.wurlod
Lookup stage is to be preferred if the reference data set will fit in memory. It does not "reduce the performance". And the inputs do not need to be sorted. Partitioning must ensure that every valid lookup will succeed.

Join stage does not support reject link. But you could employ a Join stage with a downstream Filter stage to direct those rows for which the lookup had failed to a separate link. Inputs must be identically key partitioned and sorted.

A Merge stage with a single reference input will fulfil your needs, but beware that the inputs need to be sorted, identically key partitioned and, for preference, de-duplicated.

Posted: Thu Nov 02, 2006 6:06 am
by keshav0307
which database you are using??
is your main stream rows are also very large?

Posted: Thu Nov 02, 2006 7:15 am
by devnull
Oh yeah. I don't know if this holds for other databases, but for DB2 if the input and lookup tables are in the same database, it's most efficient to do a join in the DB2 read (in this case an outer join), since DB2 does a better job optimizing retrieval of data from its database than DataStage does. Then you can use a downstream filter with a reject link.

Posted: Thu Nov 02, 2006 8:21 am
by ray.wurlod
Michael's contribution does hold true in most, if not all, databases. This is usually because common practice is to index foreign key columns.

Posted: Thu Nov 02, 2006 9:44 am
by kool78
Hi all,

Thanks for the responses.....

Ray, i meant wrong abt the lookup stage(tht it would reduce the performance)...may be i used a wrong word....

i wanted to say tht if the reference stream doesn't fit in the memory then what would be the best option to enhance the performance..

also for the rejects....i was looking to handle rejects which doesn't get matched with source records....

thanks n regards

Posted: Thu Nov 02, 2006 10:31 am
by talk2shaanc
devnull wrote:Oh yeah. I don't know if this holds for other databases, but for DB2 if the input and lookup tables are in the same database, it's most efficient to do a join in the DB2 read (in this case an outer join), since DB2 does a better job optimizing retrieval of data from its database than DataStage does. Then you can use a downstream filter with a reject link.
Few points to consider, before adopting any method.
Datastage join or Merge or Look-up.
1. Look-up table should be small to fit into memory, if big look-up table dont use it.
2. If using join or Merge, (I am not considering the type of input source you may have). There is burden of sorting and partitioning it. Next V.IMP question "Does your scratch area (its the place were DS, stores all the records before sorting and also sorts once it has all records from input) have enough space to sort both the stream ?? "
3. Assuming you have enough space. Depending upon volume in two input and number of nodes available for your DS, Next analyze " how much time will it take to process all the record ? (sort+partition+join)

DB2 Join:
1. Db2 join is highly efficient if you have high volume and less space on server for DS Join and MOST IMPORTANT if the table is indexed on the join keys. If the table are not indexed on Join keys, it will be an extra burden on DB2. Because it will perform full scan of the two tables.

Posted: Thu Nov 02, 2006 10:45 am
by talk2shaanc
Another 2 cents i missed out.
While sorting in the scratch, consider count of record as well as the record length (record length =sort key column + non sort key column ). You may be having just 1 million records, but if record length is 2000bytes, its 2GB space.

Remember in production your job will not be the only job, there will be many other jobs running parallely. So there will be contention for space and other resources.

Posted: Thu Nov 02, 2006 1:01 pm
by clickart
devnull wrote:Oh yeah. I don't know if this holds for other databases, but for DB2 if the input and lookup tables are in the same database, it's most efficient to do a join in the DB2 read (in this case an outer join), since DB2 does a better job optimizing retrieval of data from its database than DataStage does. Then you can use a downstream filter with a reject link.
In case of partitioned/federated database, its better to verify whether the tables you intend to join in DB2 are co-located. If the tables are not co-located, its better to join them using Join Stage.

Posted: Thu Nov 02, 2006 1:14 pm
by talk2shaanc
clickart wrote: In case of partitioned/federated database, its better to verify whether the tables you intend to join in DB2 are co-located. If the tables are not co-located, its better to join them using Join Stage.
True, If your DB is not co-located, then you cannot use DB2 join...But there is another situation here...you "may" have to use ODBC stage to fetch the data or write data for one of the table to Dataset in jobA and join in JobB...if you use DB2 stage for both table, DS-PX won't be able to connect to TWO INSTANCES in same job...

Again deciding factor is space, now not only in scratch area but also in the work area(work area is physical location of records for the dataset; only if dataset used)