Join Stage Dynamics

jerome_rajan · Post by **jerome_rajan** » Tue May 12, 2015 1:20 am

Hi,

Scenario -
The left link(stream) has 100 records with just 5 keys and the right link(reference) has around 10000 records with 50 keys.
I'm looking to do an Inner Join. The data is key partitioned and sorted.

Question 1 -
Will DataStage bring in all 10000 records into the server? Or will it stop pulling data as soon as it has found data for all 5 keys from the stream link?

Question 2 -
The idea is to bring in data from the reference table only for matching keys in the stream. The reference data is huge and I'm wondering if bringing all the data into DataStage to do a join would be optimal or not. What are some of the other approaches/strategies to achieve optimum performance to do such key based extraction?
I was thinking on the lines of creating a temporary table that would hold all the keys from the stream link. I'd use this temp table and do an in-database join to get only the required data from my reference table.

Appreciate any inputs on this.
Thank you

ShaneMuir · Post by **ShaneMuir** » Tue May 12, 2015 3:35 am

If I understand you correctly then you have an input link of say 5 records but you are referencing 50000 records?

If that is the case and depending on the reference input (is it a DB?) I would use a sparse lookup. A sparse lookup performs an individual lookup for each input row to the reference. So in the example above it would perform 5 calls to the reference DB table.

Generally speaking its best to avoid sparse lookups for the very reason that perform a lot of calls to the reference DB, but the example you give is a very good example of where it is potentially the better option.

chulett · Post by **chulett** » Tue May 12, 2015 6:45 am

Are we really talking about a Server job here? The join stage there isn't anything like the Parallel version.

ShaneMuir · Post by **ShaneMuir** » Tue May 12, 2015 7:49 am

chulett wrote:Are we really talking about a Server job here? The join stage there isn't anything like the Parallel version.

Whoops

failed to notice the Server job bit!

Then yes if a server job, please disregard anything I have said.

jerome_rajan · Post by **jerome_rajan** » Wed May 13, 2015 12:18 am

We're working with parallel jobs. I'm sorry I didn't mention it before. My bad.

The issue with sparse lookup is that the DB is hit for each row from the stream - meaning 60000 times for 60000 rows in the stream link. Though the volume on the reference link is definitely 100 times more, I think it makes more sense to do a normal join here.

My question is a little different though. I want to know if the join would pull in all the data from the reference table though the actual number of hits are going to be for a small subset of keys.

Mike · Post by **Mike** » Wed May 13, 2015 6:05 am

The join operator is not going to be your problem here. It will be the sort operator that gets inserted because the join requires sorted inputs. Sort is a blocking operator, so it won't emit any output rows until it has consumed all of its input rows.

I think your temp table idea with doing the join inside of the database may be your best option...

The join operator will stop consuming rows once the driving link has run out of input, but again, it's not the join operator that will be your bottleneck.

You could try an ORDER BY in your extract sql followed by an explicit sort stage set to don't sort/previously sorted, but I've never tested that to see if it will also result in blocking.

Mike

jerome_rajan · Post by **jerome_rajan** » Fri May 15, 2015 5:47 am

Thanks Mike! Exactly what I was looking for :D