DB2 SQL Join Vs Join Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
bakul
Participant
Posts: 60
Joined: Wed Nov 10, 2004 2:12 am

DB2 SQL Join Vs Join Stage

Post by bakul »

I have 3 tables A, B and C. I need three kinds of information
1) Information from table A
2) Information from table B for all rows in A
3) Information from table C for all the rows in B.
Which of the following approaches would be better performance-wise:
1) Use 3 DB2 stages each to extract the above 3 information chunks i.e use join in the SQL statement (1 SQL statement would extract data from 'A', 2nd SQL statement would join A and B and the third would join A and C)
Or
2) Extract records from the 3 tables and join them using Join Stage
?

:roll:
Also in the first approach would it be better to extract the data in datasets and then process it or should the data from DB2 stage be used directly without generating intermediate datasets?

Regards,
Bakul
dxp
Charter Member
Charter Member
Posts: 48
Joined: Sat Oct 01, 2005 10:23 am

Re: DB2 SQL Join Vs Join Stage

Post by dxp »

bakul wrote:1) Use 3 DB2 stages each to extract the above 3 information chunks i.e use join in the SQL statement (1 SQL statement would extract data from 'A', 2nd SQL statement would join A and B and the third would join A and C)
Or
2) Extract records from the 3 tables and join them using Join Stage
?

Regards,
Bakul

Hi bakul,

if total records in each table is less, then
1. write sql using one db2 stage using 'userdefined sql' option. load this result to 'dataset'...then in another job write this dataset to db2.
u can write to db2 within the same job. but for modularity its better to use different job.

2. u can use 'userdefine sql' option for joining two tables in one db2 stage, and then use another db2 stage to join third table.....then...to..dataset...to db2.


dxp.
bakul
Participant
Posts: 60
Joined: Wed Nov 10, 2004 2:12 am

Post by bakul »

Thanks for the reply!
However my query is slightly different. I wish to know which option is better -
1) Joining in the SQL query
or
2) Joining using Join Stage

Regards,
Bakul
dxp
Charter Member
Charter Member
Posts: 48
Joined: Sat Oct 01, 2005 10:23 am

Post by dxp »

bakul wrote:Thanks for the reply!
However my query is slightly different. I wish to know which option is better -
1) Joining in the SQL query
or
2) Joining using Join Stage

Regards,
Bakul

IF QUERY IS NOT COMLEX FOR UNDERSTANDABILITY.......i.e...if infuture someone has to look at it...they should understand with a little bit effort.HOPE NOW U GOT THE IDEA ABOUT WHERE TO BREAK THE QUERY.
One good thing about db2 is, u can write any length of query in one db2 stage. previously there was bug which restricts the query upto 1K. but it was fixed.
so, break the query according to understandability.

its better to use less no. of db2 stages.
as no. of db2 stages increases...performance will reduce due to no. of times DataStage has to interact with db2.


dxp.
track_star
Participant
Posts: 60
Joined: Sat Jan 24, 2004 12:52 pm
Location: Mount Carmel, IL

Post by track_star »

Are the tables partitioned? If not, I agree with dxp. If so, I disagree. All user-defined SQL queries are read sequentially into the framework, whereas a partitioned table will be read into the framework in parallel, with the number of nodes equal to the number of partitions.

Another factor would be the number of rows in the table. If there aren't that many rows in the tables, you could go with the user-defined approach. If the tables are large, you would probably be better off letting the framework handle the load. Data pipelining in the framework will immediately begin if all you're doing is reading records from a table, where the pipeline won't begin to fill in a user-defined query until the query itself begins to return records. If that query takes any amount of time to run, you're going to wait for everything else to start. Hope that helps!
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Good point about partitioning.

Volume: The other thing I take into consideration is whether joining in SQL provides access to a filter that you would normally apply in the job that significantly reduces the number of rows you are taking out of DB2. For example 4 million rows in Table A, filter on a value in table B reduces this to 2 million. More relevent for a remote DB2 database where rows come across a network.

Database indexing: Do you have the indexes on the three tables in DB2 to perform the sql efficiently? How long does your SQL take? Is that an acceptable time? Double it to take into account future growth and RDBMS load, is it still an acceptable time?

Database load: Are there database users who are inconvenienced by complex joins?

Shared lookups: Are you joining to tables B and C in other jobs? Can you turn them into lookup filesets or datasets and share them between jobs? Faster jobs, less network traffic, less RDBMS load.
dxp
Charter Member
Charter Member
Posts: 48
Joined: Sat Oct 01, 2005 10:23 am

Post by dxp »

Hi track_star and vmcburney,

thank for great inputs about performance issues.



dxp.
bakul
Participant
Posts: 60
Joined: Wed Nov 10, 2004 2:12 am

Post by bakul »

Thanks for all your inputs. :)
Post Reply