Hi,
Scenario -
The left link(stream) has 100 records with just 5 keys and the right link(reference) has around 10000 records with 50 keys.
I'm looking to do an Inner Join. The data is key partitioned and sorted.
Question 1 -
Will DataStage bring in all 10000 records into the server? Or will it stop pulling data as soon as it has found data for all 5 keys from the stream link?
Question 2 -
The idea is to bring in data from the reference table only for matching keys in the stream. The reference data is huge and I'm wondering if bringing all the data into DataStage to do a join would be optimal or not. What are some of the other approaches/strategies to achieve optimum performance to do such key based extraction?
I was thinking on the lines of creating a temporary table that would hold all the keys from the stream link. I'd use this temp table and do an in-database join to get only the required data from my reference table.
Appreciate any inputs on this.
Thank you
Join Stage Dynamics
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Join Stage Dynamics
Last edited by jerome_rajan on Wed May 13, 2015 12:14 am, edited 1 time in total.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
If I understand you correctly then you have an input link of say 5 records but you are referencing 50000 records?
If that is the case and depending on the reference input (is it a DB?) I would use a sparse lookup. A sparse lookup performs an individual lookup for each input row to the reference. So in the example above it would perform 5 calls to the reference DB table.
Generally speaking its best to avoid sparse lookups for the very reason that perform a lot of calls to the reference DB, but the example you give is a very good example of where it is potentially the better option.
If that is the case and depending on the reference input (is it a DB?) I would use a sparse lookup. A sparse lookup performs an individual lookup for each input row to the reference. So in the example above it would perform 5 calls to the reference DB table.
Generally speaking its best to avoid sparse lookups for the very reason that perform a lot of calls to the reference DB, but the example you give is a very good example of where it is potentially the better option.
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
We're working with parallel jobs. I'm sorry I didn't mention it before. My bad.
The issue with sparse lookup is that the DB is hit for each row from the stream - meaning 60000 times for 60000 rows in the stream link. Though the volume on the reference link is definitely 100 times more, I think it makes more sense to do a normal join here.
My question is a little different though. I want to know if the join would pull in all the data from the reference table though the actual number of hits are going to be for a small subset of keys.
The issue with sparse lookup is that the DB is hit for each row from the stream - meaning 60000 times for 60000 rows in the stream link. Though the volume on the reference link is definitely 100 times more, I think it makes more sense to do a normal join here.
My question is a little different though. I want to know if the join would pull in all the data from the reference table though the actual number of hits are going to be for a small subset of keys.
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
The join operator is not going to be your problem here. It will be the sort operator that gets inserted because the join requires sorted inputs. Sort is a blocking operator, so it won't emit any output rows until it has consumed all of its input rows.
I think your temp table idea with doing the join inside of the database may be your best option...
The join operator will stop consuming rows once the driving link has run out of input, but again, it's not the join operator that will be your bottleneck.
You could try an ORDER BY in your extract sql followed by an explicit sort stage set to don't sort/previously sorted, but I've never tested that to see if it will also result in blocking.
Mike
I think your temp table idea with doing the join inside of the database may be your best option...
The join operator will stop consuming rows once the driving link has run out of input, but again, it's not the join operator that will be your bottleneck.
You could try an ORDER BY in your extract sql followed by an explicit sort stage set to don't sort/previously sorted, but I've never tested that to see if it will also result in blocking.
Mike
-
- Premium Member
- Posts: 376
- Joined: Sat Jan 07, 2012 12:25 pm
- Location: Piscataway
Thanks Mike! Exactly what I was looking for :D
Jerome
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.
Data Integration Consultant at AWS
Connect With Me On LinkedIn
Life is really simple, but we insist on making it complicated.