Regarding how to improve performance for joins
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 8
- Joined: Sat Apr 22, 2006 2:28 am
- Location: Bangalore
- Contact:
Regarding how to improve performance for joins
Hi All,
This is my first post to the group.
Just wanted to know how to increase the performance for the join.
My input data is around two million and reference link(for DB2 tqable is around 40-70 million.
I see that join itself which is better alternative over lookup, is taking lot of time. To read these many records into datastage is taking lot of time around 15 mins .
i am using 4 node configuration.
Thanks in advance for the reply.. :D
This is my first post to the group.
Just wanted to know how to increase the performance for the join.
My input data is around two million and reference link(for DB2 tqable is around 40-70 million.
I see that join itself which is better alternative over lookup, is taking lot of time. To read these many records into datastage is taking lot of time around 15 mins .
i am using 4 node configuration.
Thanks in advance for the reply.. :D
Thanks & Regards,
Subhani Basha Syed.
Subhani Basha Syed.
Hi and Welcome Aboard ,
Would doing it the other way around work?
IHTH,
Would doing it the other way around work?
IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.
Search before posting:)
Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Time is money but when you don't have money time is all you can afford.
Search before posting:)
Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Are you even loading the data in that same job to a DB2 table?
If YES, then you better split the job and keep the lookup piece exclusively for lookup only. You may load from and to a Fileset or a dataset. And have the join stage doing the joins.
If YES, then you better split the job and keep the lookup piece exclusively for lookup only. You may load from and to a Fileset or a dataset. And have the join stage doing the joins.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
i. When you are talking to huge tables, try to INDEX it properly while doing a SELECT. Build and use proper indexes on your 40-70m records table.
ii. Not pretty sure, but Hash partitioning may help you in this scenario..
ii. Not pretty sure, but Hash partitioning may help you in this scenario..
Nitin Jain | India
If everything seems to be going well, you have obviously overlooked something.
If everything seems to be going well, you have obviously overlooked something.
-
- Participant
- Posts: 8
- Joined: Sat Apr 22, 2006 2:28 am
- Location: Bangalore
- Contact:
If the two tables on which i trying to join are indexed properly(i mean on the same keys) the DB2 join would be very quick enough.koolnitz wrote:i. When you are talking to huge tables, try to INDEX it properly while doing a SELECT. Build and use proper indexes on your 40-70m records table.
ii. Not pretty sure, but Hash partitioning may help you in this scenario..
Even though the data is huge if it is partitoned properly in datasatge, join won't take lot of time.
But the problem here is with reading data into datastage,which is taking lot of time.
Thanks & Regards,
Subhani Basha Syed.
Subhani Basha Syed.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The result set of the query should stream into DataStage as fast as DB2 can deliver. The problem is somewhere else in your job design. Prove this with a job designed as with your join query in the DB2 stage. This job will show you that the join is not the issue, if what you say about the indexes is correct.
Code: Select all
DB2 -----> Peek
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
If you have having performance issues reading the large volume tables, check out the topic:
viewtopic.php?t=100002&start=0&postdays ... highlight=
I post some info there about how to setup a parallel query from DB2.
Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.
If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:
Brad.
viewtopic.php?t=100002&start=0&postdays ... highlight=
I post some info there about how to setup a parallel query from DB2.
Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.
If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:
Code: Select all
db2read1 -> hash(KEY) -> sort(KEY)
>------> join(KEY) --> output....
db2read2 -> hash(KEY) -> sort(KEY)
Brad.
Okay, I have a dumb question for anyone out there. I have seen Ray, Roy, and Arndw (and many others) post links to existing topics. But when they do it, the link shows the name of the topic not the garbled URL. Oh great DS gurus, please enlighten me! How do you do it?bcarlson wrote:viewtopic.php?t=100002&start=0&postdays ... highlight=
Brad.
ps. Sorry for posting this here, but this way I can 'quote' my own example
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Position your mouse pointer over the URL button (without clicking), and one line of help appears above the text field. It shows the two "Insert URL" syntaxes. We use the second one to effect what you describe. For example:
[url=http://www.shibumi.org/eoti.htm]End of the Internet[/url]
[url=http://www.shibumi.org/eoti.htm]End of the Internet[/url]
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Did you go to the end of the Internet?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 8
- Joined: Sat Apr 22, 2006 2:28 am
- Location: Bangalore
- Contact:
bcarlson wrote:If you have having performance issues reading the large volume tables, check out the topic:
viewtopic.php?t=100002&start=0&postdays ... highlight=
I post some info there about how to setup a parallel query from DB2.
Regarding your join, it sounds like you may get better performance doing in DB2 instead of DataStage - so just use a user-defined SQL instead of a table read.
If you need to do your join in DataStage, DONT USE the lookup - this is too high volume. Use a regular join stage (if you aren't already) and make sure you hash and sort (in that order) the output from both DB2 read stages on the SAME key(s) that you will be joining on:
Code: Select all
db2read1 -> hash(KEY) -> sort(KEY) >------> join(KEY) --> output.... db2read2 -> hash(KEY) -> sort(KEY)
Brad.
Hi Brad,
Thanks for your post.That has been very eduacative.
Now i feel that my problem can be solved if i make use of DB2 partitions which runs on 64 parallel servers for join by using user defined sql and read it into datastage making use of the nodes in Datastage :D .
Thanks & Regards,
Subhani Basha Syed.
Subhani Basha Syed.
ray.wurlod wrote:Did you go to the end of the Internet?
So did VP Al Gore invent that, too?