Lookup or Join or Merge

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kool78
Participant
Posts: 104
Joined: Sat Aug 20, 2005 2:02 pm

Lookup or Join or Merge

Post by kool78 »

Hi Gurus,

i have to perform a lookup on a table(lookup with 4 columns)....

the reference stream has data of around 6million rows..

what i was thinking was to use merge stage(as i wanted to handle the rejects as well) instead of lookup stage(which would reduce the performance)...

i think i can also prefer join stage but rejects can't be handled in join stage..so thought of using merge stage...

i thought of posting it before implementing the logic , as i wanted to get some advices or different possible solutions from dsx gurus..

any kind of help is appreciated..

thanks n regards
"Attitude always and almost determines the altitude of your Life"
devnull
Premium Member
Premium Member
Posts: 37
Joined: Wed Mar 29, 2006 11:17 am
Location: Minneapolis, MN USA

Re: Lookup or Join or Merge

Post by devnull »

kool78 wrote:Hi Gurus,

i have to perform a lookup on a table(lookup with 4 columns)....

the reference stream has data of around 6million rows..

what i was thinking was to use merge stage(as i wanted to handle the rejects as well) instead of lookup stage(which would reduce the performance)...

i think i can also prefer join stage but rejects can't be handled in join stage..so thought of using merge stage...

i thought of posting it before implementing the logic , as i wanted to get some advices or different possible solutions from dsx gurus..

any kind of help is appreciated..

thanks n regards
How did you want to handle the rejects? You could use an outerjoin in a join stage and filter the null lookup fields to a reject link.
Michael Mabin
Minneapolis, MN USA
d3vvnull@com.gmail
(Reverse com and gmail to send email)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Lookup stage is to be preferred if the reference data set will fit in memory. It does not "reduce the performance". And the inputs do not need to be sorted. Partitioning must ensure that every valid lookup will succeed.

Join stage does not support reject link. But you could employ a Join stage with a downstream Filter stage to direct those rows for which the lookup had failed to a separate link. Inputs must be identically key partitioned and sorted.

A Merge stage with a single reference input will fulfil your needs, but beware that the inputs need to be sorted, identically key partitioned and, for preference, de-duplicated.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
keshav0307
Premium Member
Premium Member
Posts: 783
Joined: Mon Jan 16, 2006 10:17 pm
Location: Sydney, Australia

Post by keshav0307 »

which database you are using??
is your main stream rows are also very large?
devnull
Premium Member
Premium Member
Posts: 37
Joined: Wed Mar 29, 2006 11:17 am
Location: Minneapolis, MN USA

Post by devnull »

Oh yeah. I don't know if this holds for other databases, but for DB2 if the input and lookup tables are in the same database, it's most efficient to do a join in the DB2 read (in this case an outer join), since DB2 does a better job optimizing retrieval of data from its database than DataStage does. Then you can use a downstream filter with a reject link.
Michael Mabin
Minneapolis, MN USA
d3vvnull@com.gmail
(Reverse com and gmail to send email)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Michael's contribution does hold true in most, if not all, databases. This is usually because common practice is to index foreign key columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kool78
Participant
Posts: 104
Joined: Sat Aug 20, 2005 2:02 pm

Post by kool78 »

Hi all,

Thanks for the responses.....

Ray, i meant wrong abt the lookup stage(tht it would reduce the performance)...may be i used a wrong word....

i wanted to say tht if the reference stream doesn't fit in the memory then what would be the best option to enhance the performance..

also for the rejects....i was looking to handle rejects which doesn't get matched with source records....

thanks n regards
"Attitude always and almost determines the altitude of your Life"
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

devnull wrote:Oh yeah. I don't know if this holds for other databases, but for DB2 if the input and lookup tables are in the same database, it's most efficient to do a join in the DB2 read (in this case an outer join), since DB2 does a better job optimizing retrieval of data from its database than DataStage does. Then you can use a downstream filter with a reject link.
Few points to consider, before adopting any method.
Datastage join or Merge or Look-up.
1. Look-up table should be small to fit into memory, if big look-up table dont use it.
2. If using join or Merge, (I am not considering the type of input source you may have). There is burden of sorting and partitioning it. Next V.IMP question "Does your scratch area (its the place were DS, stores all the records before sorting and also sorts once it has all records from input) have enough space to sort both the stream ?? "
3. Assuming you have enough space. Depending upon volume in two input and number of nodes available for your DS, Next analyze " how much time will it take to process all the record ? (sort+partition+join)

DB2 Join:
1. Db2 join is highly efficient if you have high volume and less space on server for DS Join and MOST IMPORTANT if the table is indexed on the join keys. If the table are not indexed on Join keys, it will be an extra burden on DB2. Because it will perform full scan of the two tables.
Shantanu Choudhary
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

Another 2 cents i missed out.
While sorting in the scratch, consider count of record as well as the record length (record length =sort key column + non sort key column ). You may be having just 1 million records, but if record length is 2000bytes, its 2GB space.

Remember in production your job will not be the only job, there will be many other jobs running parallely. So there will be contention for space and other resources.
Shantanu Choudhary
clickart
Premium Member
Premium Member
Posts: 15
Joined: Tue Oct 18, 2005 10:14 pm
Location: Schaumburg, IL

Post by clickart »

devnull wrote:Oh yeah. I don't know if this holds for other databases, but for DB2 if the input and lookup tables are in the same database, it's most efficient to do a join in the DB2 read (in this case an outer join), since DB2 does a better job optimizing retrieval of data from its database than DataStage does. Then you can use a downstream filter with a reject link.
In case of partitioned/federated database, its better to verify whether the tables you intend to join in DB2 are co-located. If the tables are not co-located, its better to join them using Join Stage.
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

clickart wrote: In case of partitioned/federated database, its better to verify whether the tables you intend to join in DB2 are co-located. If the tables are not co-located, its better to join them using Join Stage.
True, If your DB is not co-located, then you cannot use DB2 join...But there is another situation here...you "may" have to use ODBC stage to fetch the data or write data for one of the table to Dataset in jobA and join in JobB...if you use DB2 stage for both table, DS-PX won't be able to connect to TWO INSTANCES in same job...

Again deciding factor is space, now not only in scratch area but also in the work area(work area is physical location of records for the dataset; only if dataset used)
Shantanu Choudhary
Post Reply