Joining of tables

aparnanarale · Post by **aparnanarale** » Wed Feb 09, 2005 12:05 am

I want to join ten DB2 tables and put the data into a sequential file.There will be millions of records.Is it better to write a sql query in a DB2 stage
or can i extract the data into hash files and the have a join on these files?
Performace wise which one will be better?

Aparna

wnogalski · Post by **wnogalski** » Wed Feb 09, 2005 1:56 am

It depends - sometimes the join on the DB side will perform better, sometimes it better to do joins (or some of them) on DS side. You have to experiment

ray.wurlod · Post by **ray.wurlod** » Wed Feb 09, 2005 1:58 am

Welcome aboard! :D

Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.

The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups

In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.

In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".

wnogalski · Post by **wnogalski** » Wed Feb 09, 2005 2:29 am

If You'll try to join everything on the DB side and the performance is poor see which join makes the query run so long, then try to perform this join using DS, and so on.

aparnanarale · Post by **aparnanarale** » Wed Feb 09, 2005 4:16 am

Thanks ray.Lookup on table will be costly when the data is huge.Also every time it will hit the table and connection will be open.There will be posiibility of deadlock.So can't have lookup on table.

Can try the second option.
Also will it take more time to process the SQL query having so many joins?I'll be writing that query in the Datastage DB2 stage.

Aparna

ray.wurlod wrote:Welcome aboard! :D

Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.

The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups

In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.

In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".

aparnanarale · Post by **aparnanarale** » Wed Feb 09, 2005 4:19 am

Thanks wnogalski.
I will try it out.

Aparna

wnogalski wrote:If You'll try to join everything on the DB side and the performance is poor see which join makes the query run so long, then try to perform this join using DS, and so on.

ray.wurlod · Post by **ray.wurlod** » Wed Feb 09, 2005 4:23 pm

Ultimately, whether you do the joins in the database or (to DB2 tables) in DataStage, the same amount of join work has to take place.

The same amount of work still has to take place if you use hashed files, but these can be cached in memory (if not too large), which is where the throughput gain comes from.

amsh76 · Post by **amsh76** » Wed Feb 09, 2005 4:39 pm

I would suggest try to join the tables if are indexed on Join Conditions. The DataStage job may not return any result for first few sec..but once the resultset is ready its going to run very fast.

But again it all depends..

You can also join 5 tables in Source stage and create 5 lookups..ut all depends on structure of your tables and joining conditions.