Regarding how to improve performance for joins

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
syed_subhaan
Participant
Posts: 8
Joined: Sat Apr 22, 2006 2:28 am
Location: Bangalore
Contact:

Regarding how to improve performance for joins

Post by syed_subhaan »

Hi All,
This is my first post to the group.

Just wanted to know how to increase the performance for the join.
My input data is around two million and reference link(for DB2 tqable is around 40-70 million.
I see that join itself which is better alternative over lookup, is taking lot of time. To read these many records into datastage is taking lot of time around 15 mins .
i am using 4 node configuration.
Thanks in advance for the reply.. :D
Thanks & Regards,
Subhani Basha Syed.
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi and Welcome Aboard :),
Would doing it the other way around work?

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Are you even loading the data in that same job to a DB2 table?
If YES, then you better split the job and keep the lookup piece exclusively for lookup only. You may load from and to a Fileset or a dataset. And have the join stage doing the joins.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
koolnitz
Participant
Posts: 138
Joined: Wed Sep 07, 2005 5:39 am

Post by koolnitz »

i. When you are talking to huge tables, try to INDEX it properly while doing a SELECT. Build and use proper indexes on your 40-70m records table.
ii. Not pretty sure, but Hash partitioning may help you in this scenario..
Nitin Jain | India

If everything seems to be going well, you have obviously overlooked something.
syed_subhaan
Participant
Posts: 8
Joined: Sat Apr 22, 2006 2:28 am
Location: Bangalore
Contact:

Post by syed_subhaan »

koolnitz wrote:i. When you are talking to huge tables, try to INDEX it properly while doing a SELECT. Build and use proper indexes on your 40-70m records table.
ii. Not pretty sure, but Hash partitioning may help you in this scenario..
If the two tables on which i trying to join are indexed properly(i mean on the same keys) the DB2 join would be very quick enough.
Even though the data is huge if it is partitoned properly in datasatge, join won't take lot of time.
But the problem here is with reading data into datastage,which is taking lot of time.
Thanks & Regards,
Subhani Basha Syed.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The result set of the query should stream into DataStage as fast as DB2 can deliver. The problem is somewhere else in your job design. Prove this with a job designed as

Code: Select all

DB2 -----> Peek
with your join query in the DB2 stage. This job will show you that the join is not the issue, if what you say about the indexes is correct.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

If you have having performance issues reading the large volume tables, check out the topic:

viewtopic.php?t=100002&start=0&postdays ... highlight=

I post some info there about how to setup a parallel query from DB2.

Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.

If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:

Code: Select all

db2read1  ->  hash(KEY)  ->  sort(KEY)  
                                                        >------> join(KEY) --> output....
db2read2  ->  hash(KEY)  ->  sort(KEY) 


Brad.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Okay, I have a dumb question for anyone out there. I have seen Ray, Roy, and Arndw (and many others) post links to existing topics. But when they do it, the link shows the name of the topic not the garbled URL. Oh great DS gurus, please enlighten me! How do you do it?

Brad.

ps. Sorry for posting this here, but this way I can 'quote' my own example :)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Position your mouse pointer over the URL button (without clicking), and one line of help appears above the text field. It shows the two "Insert URL" syntaxes. We use the second one to effect what you describe. For example:

[url=http://www.shibumi.org/eoti.htm]End of the Internet[/url]
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

testing url posting:

Performance Issue

Did this work?

Cool, learn something new everyday! Thanks!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
syed_subhaan
Participant
Posts: 8
Joined: Sat Apr 22, 2006 2:28 am
Location: Bangalore
Contact:

Post by syed_subhaan »

bcarlson wrote:If you have having performance issues reading the large volume tables, check out the topic:

viewtopic.php?t=100002&start=0&postdays ... highlight=

I post some info there about how to setup a parallel query from DB2.

Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.

If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:

Code: Select all

db2read1  ->  hash(KEY)  ->  sort(KEY)  
                                                        >------> join(KEY) --> output....
db2read2  ->  hash(KEY)  ->  sort(KEY) 


Brad.


Hi Brad,
Thanks for your post.That has been very eduacative.
Now i feel that my problem can be solved if i make use of DB2 partitions which runs on 64 parallel servers for join by using user defined sql and read it into datastage making use of the nodes in Datastage :D .
Thanks & Regards,
Subhani Basha Syed.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

ray.wurlod wrote:Did you go to the end of the Internet?

So did VP Al Gore invent that, too? :wink:
Post Reply