I want to join ten DB2 tables and put the data into a sequential file.There will be millions of records.Is it better to write a sql query in a DB2 stage
or can i extract the data into hash files and the have a join on these files?
Performace wise which one will be better?
Aparna
Joining of tables
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Welcome aboard! :D
Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.
The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups
In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.
In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".
Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.
The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups
In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.
In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 6
- Joined: Mon Feb 07, 2005 11:54 pm
Thanks ray.Lookup on table will be costly when the data is huge.Also every time it will hit the table and connection will be open.There will be posiibility of deadlock.So can't have lookup on table.
Can try the second option.
Also will it take more time to process the SQL query having so many joins?I'll be writing that query in the Datastage DB2 stage.
Aparna
Can try the second option.
Also will it take more time to process the SQL query having so many joins?I'll be writing that query in the Datastage DB2 stage.
Aparna
ray.wurlod wrote:Welcome aboard! :D
Your answer is "it depends". If the resources are there to do a ten table join in the database, do that. It will mean that DataStage only has to process the result set, and processes no redundant data. In general this is the preferred approach.
The other approaches are:
(a) to have DataStage to process all relevant rows from the driver table and perform lookups against the nine probe tables
(b) to pre-load hashed files from the nine probe tables and use these, cached in memory for preference, for the lookups
In the first case DB2 probably ends up doing even more work than the ten-way join, and DataStage gets to process a whole heap of data it ends up not using. All in all, a forgettable approach.
In the second case, performance can be better or worse than doing the ten-way join, depending on the load on DB2, the load on the machine, the load on the network, and so on. So, "it depends".
-
- Participant
- Posts: 6
- Joined: Mon Feb 07, 2005 11:54 pm
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Ultimately, whether you do the joins in the database or (to DB2 tables) in DataStage, the same amount of join work has to take place.
The same amount of work still has to take place if you use hashed files, but these can be cached in memory (if not too large), which is where the throughput gain comes from.
The same amount of work still has to take place if you use hashed files, but these can be cached in memory (if not too large), which is where the throughput gain comes from.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I would suggest try to join the tables if are indexed on Join Conditions. The DataStage job may not return any result for first few sec..but once the resultset is ready its going to run very fast.
But again it all depends..
You can also join 5 tables in Source stage and create 5 lookups..ut all depends on structure of your tables and joining conditions.
But again it all depends..
You can also join 5 tables in Source stage and create 5 lookups..ut all depends on structure of your tables and joining conditions.