Joining Billions of records with millions

ThilSe · Post by **ThilSe** » Fri Feb 17, 2006 4:06 am

Hi,

I want to Join table1 (with 100 million records) with table2 (with 2 billion records).

The join keys are not primary or partition keys. The tables are in DB2 UDB.

i want to know the effect of using the join stage or Lookup fileset?

I read lookup file sets are faster in http://datawarehouse.ittoolbox.com/blog ... asp?i=7183. But what about the time taken for loading the 2 billion data into Lookup file set?

Can you pls share your thoughts on this?

Thanks/Regards
Senthil

Mat01 · Post by **Mat01** » Fri Feb 17, 2006 8:20 am

Hi Senthil,

You would need a LOT of RAM to use a lookup stage there! I'd recommend a join stage since the amount of memory used will be much lower. Anybody else working on the same server will appreciate you for this. Also, sort the DB2 ouput by your join keys, this will help.

HTH,

Mat

PS: Think about doing the join directly in DB2. Depending on the join and how DB2 is configured, this might work better.

kumar_s · Post by **kumar_s** » Sat Feb 18, 2006 3:17 am

Hi,

As Vincent suggested in his blog, you can use the join when you need to carry more number of columns from the reference file. But still with the size of the file you have, I would suggest Join rather than Lookup in this point.

vmcburney · Post by **vmcburney** » Sun Feb 19, 2006 3:52 am

Yah, join with the reference data already sorted may be good as 2 billion will never fit into lookup RAM memory. Make sure your job doesn't try to resort the data. The lookup fileset would take a long time to load and may be an option if you use this particular 2 billion dataset in multiple jobs.

With these volumes you could also consider adding Data Integrator (formerly DB2 integrator) to the mix and doing the join across databases as a source to a DataStage job.

rasi · Post by **rasi** » Sun Feb 19, 2006 10:05 pm

Join is more efficient compared to lookup fileset in your case. Also try to see whether you can do the join in the DB2 database itself. You might need to talk to your DB2 dba about this and check about the indexes on the join keys etc.

ThilSe · Post by **ThilSe** » Mon Feb 20, 2006 2:26 am

Ppl,

Thanks for your inputs!!

-Senthil

MTA · Post by **MTA** » Tue Feb 21, 2006 11:13 am

Telsi, It is important to know how much scratch disc is avalilable. since sorting takes place in the scratch, I doubt the volume you are refering here can be handled in the scratch with out effecting other applications. I agree with others on considering a database join .

kumar_s · Post by **kumar_s** » Wed Feb 22, 2006 8:43 am

MTA wrote:Telsi, It is important to know how much scratch disc is avalilable. since sorting takes place in the scratch, I doubt the volume you are refering here can be handled in the scratch with out effecting other applications. I agree with others on considering a database join .

Lookup fileset will also built a lookup table beyond cetrain limit. Virtual dataset will also hold up a the tmp space. Transcation log in Database should also be monitored carefully during load.
Handling with Large data should be interesting

bcarlson · Post by **bcarlson** » Thu Feb 23, 2006 12:51 pm

Considering the DB2 tables are partitioned differently and the keys are different, I personally would NOT recommend doing the join in DB2. Just like DataStage, DB2 gets its best join performance when the data is partitioned and sorted the same between your input datasets. When you join between tables that are partitioned differently, then DB2 has to repartition (in memory, I might add) one of the tables on the fly. To be honest, I don't think this is one of DB2's strong points.

We have large tables that we join (although not as large as 2 billion) and we get better performance joining in DataStage where we can easily repartition and sort.

As has been mentioned before, use a join stage and make sure that it is first hash partitioned on your join keys and then sorted (again, on the same keys). When you pull your data from DB2, anything you can do to limit the result set of your query will help. If you don't need 100% of the data from either table to do the join, then DON"T retrieve it if you don't have to. Only pull the columns you must have in your final dataset.

HTH.

Brad.

Mallayagari · Post by **Mallayagari** » Wed Mar 01, 2006 11:33 am

Hi Senthil

As others said using look-u is big NO. There are two options
(a) if the two tables can be indexed on the keys that you are trying to join then join those two tables on the database instead on Datastage.
(b)If you want to join the two tables on Datastage, check with your Datastage admin whether you have the required scratch disk to handle the join. Also try to restrict the amount of data that you bring on Datastage from both the tables... like only last one month data or only records with flaf set to something.

DSXchange

Joining Billions of records with millions

Joining Billions of records with millions

Re: Joining Billions of records with millions