Join Stage Dynamics

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Join Stage Dynamics

Post by jerome_rajan »

Hi,

Scenario -
The left link(stream) has 100 records with just 5 keys and the right link(reference) has around 10000 records with 50 keys.
I'm looking to do an Inner Join. The data is key partitioned and sorted.

Question 1 -
Will DataStage bring in all 10000 records into the server? Or will it stop pulling data as soon as it has found data for all 5 keys from the stream link?

Question 2 -
The idea is to bring in data from the reference table only for matching keys in the stream. The reference data is huge and I'm wondering if bringing all the data into DataStage to do a join would be optimal or not. What are some of the other approaches/strategies to achieve optimum performance to do such key based extraction?
I was thinking on the lines of creating a temporary table that would hold all the keys from the stream link. I'd use this temp table and do an in-database join to get only the required data from my reference table.

Appreciate any inputs on this.
Thank you
Last edited by jerome_rajan on Wed May 13, 2015 12:14 am, edited 1 time in total.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

If I understand you correctly then you have an input link of say 5 records but you are referencing 50000 records?

If that is the case and depending on the reference input (is it a DB?) I would use a sparse lookup. A sparse lookup performs an individual lookup for each input row to the reference. So in the example above it would perform 5 calls to the reference DB table.

Generally speaking its best to avoid sparse lookups for the very reason that perform a lot of calls to the reference DB, but the example you give is a very good example of where it is potentially the better option.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Are we really talking about a Server job here? The join stage there isn't anything like the Parallel version.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

chulett wrote:Are we really talking about a Server job here? The join stage there isn't anything like the Parallel version.
Whoops :oops: failed to notice the Server job bit!

Then yes if a server job, please disregard anything I have said.
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

We're working with parallel jobs. I'm sorry I didn't mention it before. My bad.

The issue with sparse lookup is that the DB is hit for each row from the stream - meaning 60000 times for 60000 rows in the stream link. Though the volume on the reference link is definitely 100 times more, I think it makes more sense to do a normal join here.

My question is a little different though. I want to know if the join would pull in all the data from the reference table though the actual number of hits are going to be for a small subset of keys.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

The join operator is not going to be your problem here. It will be the sort operator that gets inserted because the join requires sorted inputs. Sort is a blocking operator, so it won't emit any output rows until it has consumed all of its input rows.

I think your temp table idea with doing the join inside of the database may be your best option...

The join operator will stop consuming rows once the driving link has run out of input, but again, it's not the join operator that will be your bottleneck.

You could try an ORDER BY in your extract sql followed by an explicit sort stage set to don't sort/previously sorted, but I've never tested that to see if it will also result in blocking.

Mike
jerome_rajan
Premium Member
Premium Member
Posts: 376
Joined: Sat Jan 07, 2012 12:25 pm
Location: Piscataway

Post by jerome_rajan »

Thanks Mike! Exactly what I was looking for :D
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn

Life is really simple, but we insist on making it complicated.
Post Reply