Joining Billions of records with millions

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ThilSe
Participant
Posts: 80
Joined: Thu Jun 09, 2005 7:45 am

Joining Billions of records with millions

Post by ThilSe »

Hi,

I want to Join table1 (with 100 million records) with table2 (with 2 billion records).

The join keys are not primary or partition keys. The tables are in DB2 UDB.

i want to know the effect of using the join stage or Lookup fileset?

I read lookup file sets are faster in http://datawarehouse.ittoolbox.com/blog ... asp?i=7183. But what about the time taken for loading the 2 billion data into Lookup file set?

Can you pls share your thoughts on this?

Thanks/Regards
Senthil
Mat01
Premium Member
Premium Member
Posts: 50
Joined: Wed Jun 02, 2004 11:12 am
Location: Montreal, Canada

Post by Mat01 »

Hi Senthil,

You would need a LOT of RAM to use a lookup stage there! I'd recommend a join stage since the amount of memory used will be much lower. Anybody else working on the same server will appreciate you for this. Also, sort the DB2 ouput by your join keys, this will help.

HTH,

Mat

PS: Think about doing the join directly in DB2. Depending on the join and how DB2 is configured, this might work better.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,

As Vincent suggested in his blog, you can use the join when you need to carry more number of columns from the reference file. But still with the size of the file you have, I would suggest Join rather than Lookup in this point.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Yah, join with the reference data already sorted may be good as 2 billion will never fit into lookup RAM memory. Make sure your job doesn't try to resort the data. The lookup fileset would take a long time to load and may be an option if you use this particular 2 billion dataset in multiple jobs.

With these volumes you could also consider adding Data Integrator (formerly DB2 integrator) to the mix and doing the join across databases as a source to a DataStage job.
rasi
Participant
Posts: 464
Joined: Fri Oct 25, 2002 1:33 am
Location: Australia, Sydney

Post by rasi »

Join is more efficient compared to lookup fileset in your case. Also try to see whether you can do the join in the DB2 database itself. You might need to talk to your DB2 dba about this and check about the indexes on the join keys etc.
Regards
Siva

Listening to the Learned

"The most precious wealth is the wealth acquired by the ear Indeed, of all wealth that wealth is the crown." - Thirukural By Thiruvalluvar
ThilSe
Participant
Posts: 80
Joined: Thu Jun 09, 2005 7:45 am

Post by ThilSe »

Ppl,

Thanks for your inputs!!

-Senthil
MTA
Participant
Posts: 37
Joined: Thu Feb 02, 2006 2:25 pm

Post by MTA »

Telsi, It is important to know how much scratch disc is avalilable. since sorting takes place in the scratch, I doubt the volume you are refering here can be handled in the scratch with out effecting other applications. I agree with others on considering a database join .
M.T.Anwer
The day the child realizes that all adults are imperfect he becomes an adolescent;
the day he forgives them, he becomes an adult; the day he forgives himself, he becomes wise.
-Aiden Nowlan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

MTA wrote:Telsi, It is important to know how much scratch disc is avalilable. since sorting takes place in the scratch, I doubt the volume you are refering here can be handled in the scratch with out effecting other applications. I agree with others on considering a database join .
Lookup fileset will also built a lookup table beyond cetrain limit. Virtual dataset will also hold up a the tmp space. Transcation log in Database should also be monitored carefully during load.
Handling with Large data should be interesting :wink:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Considering the DB2 tables are partitioned differently and the keys are different, I personally would NOT recommend doing the join in DB2. Just like DataStage, DB2 gets its best join performance when the data is partitioned and sorted the same between your input datasets. When you join between tables that are partitioned differently, then DB2 has to repartition (in memory, I might add) one of the tables on the fly. To be honest, I don't think this is one of DB2's strong points.

We have large tables that we join (although not as large as 2 billion) and we get better performance joining in DataStage where we can easily repartition and sort.

As has been mentioned before, use a join stage and make sure that it is first hash partitioned on your join keys and then sorted (again, on the same keys). When you pull your data from DB2, anything you can do to limit the result set of your query will help. If you don't need 100% of the data from either table to do the join, then DON"T retrieve it if you don't have to. Only pull the columns you must have in your final dataset.

HTH.

Brad.
Mallayagari
Participant
Posts: 15
Joined: Tue Apr 20, 2004 3:10 am

Re: Joining Billions of records with millions

Post by Mallayagari »

Hi Senthil

As others said using look-u is big NO. There are two options
(a) if the two tables can be indexed on the keys that you are trying to join then join those two tables on the database instead on Datastage.
(b)If you want to join the two tables on Datastage, check with your Datastage admin whether you have the required scratch disk to handle the join. Also try to restrict the amount of data that you bring on Datastage from both the tables... like only last one month data or only records with flaf set to something.
Post Reply