Regarding how to improve performance for joins

syed_subhaan · Post by **syed_subhaan** » Sat Apr 22, 2006 5:53 am

Hi All,
This is my first post to the group.

Just wanted to know how to increase the performance for the join.
My input data is around two million and reference link(for DB2 tqable is around 40-70 million.
I see that join itself which is better alternative over lookup, is taking lot of time. To read these many records into datastage is taking lot of time around 15 mins .
i am using 4 node configuration.
Thanks in advance for the reply.. :D

roy · Post by **roy** » Sun Apr 23, 2006 2:03 am

Hi and Welcome Aboard

,
Would doing it the other way around work?

IHTH,

DSguru2B · Post by **DSguru2B** » Sun Apr 23, 2006 9:54 pm

Are you even loading the data in that same job to a DB2 table?
If YES, then you better split the job and keep the lookup piece exclusively for lookup only. You may load from and to a Fileset or a dataset. And have the join stage doing the joins.

koolnitz · Post by **koolnitz** » Sun Apr 23, 2006 10:16 pm

i. When you are talking to huge tables, try to INDEX it properly while doing a SELECT. Build and use proper indexes on your 40-70m records table.
ii. Not pretty sure, but Hash partitioning may help you in this scenario..

syed_subhaan · Post by **syed_subhaan** » Mon Apr 24, 2006 1:34 am

koolnitz wrote:i. When you are talking to huge tables, try to INDEX it properly while doing a SELECT. Build and use proper indexes on your 40-70m records table.
ii. Not pretty sure, but Hash partitioning may help you in this scenario..

If the two tables on which i trying to join are indexed properly(i mean on the same keys) the DB2 join would be very quick enough.
Even though the data is huge if it is partitoned properly in datasatge, join won't take lot of time.
But the problem here is with reading data into datastage,which is taking lot of time.

ray.wurlod · Post by **ray.wurlod** » Mon Apr 24, 2006 1:39 am

The result set of the query should stream into DataStage as fast as DB2 can deliver. The problem is somewhere else in your job design. Prove this with a job designed as

Code: Select all

DB2 -----> Peek

with your join query in the DB2 stage. This job will show you that the join is not the issue, if what you say about the indexes is correct.

bcarlson · Post by **bcarlson** » Tue Apr 25, 2006 12:24 pm

If you have having performance issues reading the large volume tables, check out the topic:

viewtopic.php?t=100002&start=0&postdays ... highlight=

I post some info there about how to setup a parallel query from DB2.

Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.

If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:

Code: Select all

db2read1  ->  hash(KEY)  ->  sort(KEY)  
                                                        >------> join(KEY) --> output....
db2read2  ->  hash(KEY)  ->  sort(KEY)

Brad.

bcarlson · Post by **bcarlson** » Tue Apr 25, 2006 12:27 pm

bcarlson wrote:viewtopic.php?t=100002&start=0&postdays ... highlight=

Okay, I have a dumb question for anyone out there. I have seen Ray, Roy, and Arndw (and many others) post links to existing topics. But when they do it, the link shows the name of the topic not the garbled URL. Oh great DS gurus, please enlighten me! How do you do it?

Brad.

ps. Sorry for posting this here, but this way I can 'quote' my own example

ray.wurlod · Post by **ray.wurlod** » Tue Apr 25, 2006 3:35 pm

Position your mouse pointer over the URL button (without clicking), and one line of help appears above the text field. It shows the two "Insert URL" syntaxes. We use the second one to effect what you describe. For example:

[url=http://www.shibumi.org/eoti.htm]End of the Internet[/url]

bcarlson · Post by **bcarlson** » Tue Apr 25, 2006 3:49 pm

testing url posting:

Performance Issue

Did this work?

Cool, learn something new everyday! Thanks!

ray.wurlod · Post by **ray.wurlod** » Tue Apr 25, 2006 4:45 pm

Did you go to the end of the Internet?

syed_subhaan · Post by **syed_subhaan** » Wed Apr 26, 2006 1:29 am

bcarlson wrote:If you have having performance issues reading the large volume tables, check out the topic:

viewtopic.php?t=100002&start=0&postdays ... highlight=

I post some info there about how to setup a parallel query from DB2.

Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.

If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:
Code: Select all
db2read1  ->  hash(KEY)  ->  sort(KEY)  
                                                        >------> join(KEY) --> output....
db2read2  ->  hash(KEY)  ->  sort(KEY) 
Brad.

Hi Brad,
Thanks for your post.That has been very eduacative.
Now i feel that my problem can be solved if i make use of DB2 partitions which runs on 64 parallel servers for join by using user defined sql and read it into datastage making use of the nodes in Datastage :D .

bcarlson · Post by **bcarlson** » Wed Apr 26, 2006 9:06 am

ray.wurlod wrote:Did you go to the end of the Internet?

So did VP Al Gore invent that, too?