Joining Billions of records with millions
Moderators: chulett, rschirm, roy
Joining Billions of records with millions
Hi,
I want to Join table1 (with 100 million records) with table2 (with 2 billion records).
The join keys are not primary or partition keys. The tables are in DB2 UDB.
i want to know the effect of using the join stage or Lookup fileset?
I read lookup file sets are faster in http://datawarehouse.ittoolbox.com/blog ... asp?i=7183. But what about the time taken for loading the 2 billion data into Lookup file set?
Can you pls share your thoughts on this?
Thanks/Regards
Senthil
I want to Join table1 (with 100 million records) with table2 (with 2 billion records).
The join keys are not primary or partition keys. The tables are in DB2 UDB.
i want to know the effect of using the join stage or Lookup fileset?
I read lookup file sets are faster in http://datawarehouse.ittoolbox.com/blog ... asp?i=7183. But what about the time taken for loading the 2 billion data into Lookup file set?
Can you pls share your thoughts on this?
Thanks/Regards
Senthil
Hi Senthil,
You would need a LOT of RAM to use a lookup stage there! I'd recommend a join stage since the amount of memory used will be much lower. Anybody else working on the same server will appreciate you for this. Also, sort the DB2 ouput by your join keys, this will help.
HTH,
Mat
PS: Think about doing the join directly in DB2. Depending on the join and how DB2 is configured, this might work better.
You would need a LOT of RAM to use a lookup stage there! I'd recommend a join stage since the amount of memory used will be much lower. Anybody else working on the same server will appreciate you for this. Also, sort the DB2 ouput by your join keys, this will help.
HTH,
Mat
PS: Think about doing the join directly in DB2. Depending on the join and how DB2 is configured, this might work better.
Hi,
As Vincent suggested in his blog, you can use the join when you need to carry more number of columns from the reference file. But still with the size of the file you have, I would suggest Join rather than Lookup in this point.
As Vincent suggested in his blog, you can use the join when you need to carry more number of columns from the reference file. But still with the size of the file you have, I would suggest Join rather than Lookup in this point.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Yah, join with the reference data already sorted may be good as 2 billion will never fit into lookup RAM memory. Make sure your job doesn't try to resort the data. The lookup fileset would take a long time to load and may be an option if you use this particular 2 billion dataset in multiple jobs.
With these volumes you could also consider adding Data Integrator (formerly DB2 integrator) to the mix and doing the join across databases as a source to a DataStage job.
With these volumes you could also consider adding Data Integrator (formerly DB2 integrator) to the mix and doing the join across databases as a source to a DataStage job.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Join is more efficient compared to lookup fileset in your case. Also try to see whether you can do the join in the DB2 database itself. You might need to talk to your DB2 dba about this and check about the indexes on the join keys etc.
Regards
Siva
Listening to the Learned
"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
Siva
Listening to the Learned
"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
Telsi, It is important to know how much scratch disc is avalilable. since sorting takes place in the scratch, I doubt the volume you are refering here can be handled in the scratch with out effecting other applications. I agree with others on considering a database join .
M.T.Anwer
The day the child realizes that all adults are imperfect he becomes an adolescent;
the day he forgives them, he becomes an adult; the day he forgives himself, he becomes wise.
-Aiden Nowlan
The day the child realizes that all adults are imperfect he becomes an adolescent;
the day he forgives them, he becomes an adult; the day he forgives himself, he becomes wise.
-Aiden Nowlan
Lookup fileset will also built a lookup table beyond cetrain limit. Virtual dataset will also hold up a the tmp space. Transcation log in Database should also be monitored carefully during load.MTA wrote:Telsi, It is important to know how much scratch disc is avalilable. since sorting takes place in the scratch, I doubt the volume you are refering here can be handled in the scratch with out effecting other applications. I agree with others on considering a database join .
Handling with Large data should be interesting
![Wink :wink:](./images/smilies/icon_wink.gif)
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Considering the DB2 tables are partitioned differently and the keys are different, I personally would NOT recommend doing the join in DB2. Just like DataStage, DB2 gets its best join performance when the data is partitioned and sorted the same between your input datasets. When you join between tables that are partitioned differently, then DB2 has to repartition (in memory, I might add) one of the tables on the fly. To be honest, I don't think this is one of DB2's strong points.
We have large tables that we join (although not as large as 2 billion) and we get better performance joining in DataStage where we can easily repartition and sort.
As has been mentioned before, use a join stage and make sure that it is first hash partitioned on your join keys and then sorted (again, on the same keys). When you pull your data from DB2, anything you can do to limit the result set of your query will help. If you don't need 100% of the data from either table to do the join, then DON"T retrieve it if you don't have to. Only pull the columns you must have in your final dataset.
HTH.
Brad.
We have large tables that we join (although not as large as 2 billion) and we get better performance joining in DataStage where we can easily repartition and sort.
As has been mentioned before, use a join stage and make sure that it is first hash partitioned on your join keys and then sorted (again, on the same keys). When you pull your data from DB2, anything you can do to limit the result set of your query will help. If you don't need 100% of the data from either table to do the join, then DON"T retrieve it if you don't have to. Only pull the columns you must have in your final dataset.
HTH.
Brad.
-
- Participant
- Posts: 15
- Joined: Tue Apr 20, 2004 3:10 am
Re: Joining Billions of records with millions
Hi Senthil
As others said using look-u is big NO. There are two options
(a) if the two tables can be indexed on the keys that you are trying to join then join those two tables on the database instead on Datastage.
(b)If you want to join the two tables on Datastage, check with your Datastage admin whether you have the required scratch disk to handle the join. Also try to restrict the amount of data that you bring on Datastage from both the tables... like only last one month data or only records with flaf set to something.
As others said using look-u is big NO. There are two options
(a) if the two tables can be indexed on the keys that you are trying to join then join those two tables on the database instead on Datastage.
(b)If you want to join the two tables on Datastage, check with your Datastage admin whether you have the required scratch disk to handle the join. Also try to restrict the amount of data that you bring on Datastage from both the tables... like only last one month data or only records with flaf set to something.