Joining of tables

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
aparnanarale
Participant
Posts: 6
Joined: Mon Feb 07, 2005 11:54 pm

Joining of tables

Post by aparnanarale »

I want to join ten DB2 tables and put the data into a sequential file.There will be millions of records.Is it better to write a sql query in a DB2 stage
or can i extract the data into hash files and the have a join on these files?
Performace wise which one will be better?

Aparna
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

It depends - sometimes the join on the DB side will perform better, sometimes it better to do joins (or some of them) on DS side. You have to experiment :wink:
Regards,
Wojciech Nogalski
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D

Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.

The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups

In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.

In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wnogalski
Charter Member
Charter Member
Posts: 54
Joined: Thu Jan 06, 2005 10:49 am
Location: Warsaw

Post by wnogalski »

If You'll try to join everything on the DB side and the performance is poor see which join makes the query run so long, then try to perform this join using DS, and so on.
Regards,
Wojciech Nogalski
aparnanarale
Participant
Posts: 6
Joined: Mon Feb 07, 2005 11:54 pm

Post by aparnanarale »

Thanks ray.Lookup on table will be costly when the data is huge.Also every time it will hit the table and connection will be open.There will be posiibility of deadlock.So can't have lookup on table.

Can try the second option.
Also will it take more time to process the SQL query having so many joins?I'll be writing that query in the Datastage DB2 stage.

Aparna
ray.wurlod wrote:Welcome aboard! :D

Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.

The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups

In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.

In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".
aparnanarale
Participant
Posts: 6
Joined: Mon Feb 07, 2005 11:54 pm

Post by aparnanarale »

Thanks wnogalski.
I will try it out.

Aparna
wnogalski wrote:If You'll try to join everything on the DB side and the performance is poor see which join makes the query run so long, then try to perform this join using DS, and so on.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ultimately, whether you do the joins in the database or (to DB2 tables) in DataStage, the same amount of join work has to take place.

The same amount of work still has to take place if you use hashed files, but these can be cached in memory (if not too large), which is where the throughput gain comes from.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
amsh76
Charter Member
Charter Member
Posts: 118
Joined: Wed Mar 10, 2004 10:58 pm

Post by amsh76 »

I would suggest try to join the tables if are indexed on Join Conditions. The DataStage job may not return any result for first few sec..but once the resultset is ready its going to run very fast.

But again it all depends..

You can also join 5 tables in Source stage and create 5 lookups..ut all depends on structure of your tables and joining conditions.
Post Reply