How do I fix performance bottleneck at Sort stage?
Moderators: chulett, rschirm, roy
How do I fix performance bottleneck at Sort stage?
Hi,
One of my job is giving me serious bottleneck issues and any help in resolving that is highly appreciated.
I am using following job design :
Dataset --> joined to a DB2 table using join stage --> Filter Stage --> sort stage --> remove duplicate stage.
From Join stage to filter stage processing is fine, the job is able to process more than 100K rows per second. However at sort and remove duplicate stage this number drops to 2500 rows per second.
Please let me know if there anything I can do to improve processing speed in this job.
Thanks,
One of my job is giving me serious bottleneck issues and any help in resolving that is highly appreciated.
I am using following job design :
Dataset --> joined to a DB2 table using join stage --> Filter Stage --> sort stage --> remove duplicate stage.
From Join stage to filter stage processing is fine, the job is able to process more than 100K rows per second. However at sort and remove duplicate stage this number drops to 2500 rows per second.
Please let me know if there anything I can do to improve processing speed in this job.
Thanks,
ABHILASH
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The rows/sec figure is meaningless.
Think about operation of a sort. It can't output any rows till all its input rows have arrived. However the clock starts running immediately the job starts. Therefore the actual rows/sec out of the Sort stage - indeed out of any blocking stage - will be substantially under-reported.
Think about operation of a sort. It can't output any rows till all its input rows have arrived. However the clock starts running immediately the job starts. Therefore the actual rows/sec out of the Sort stage - indeed out of any blocking stage - will be substantially under-reported.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Depends what you want to use it for. As a general rule I'd say it's meaningless everywhere (only a very few exceptions), since the clock is always running, even during startup, waiting for I/O to return, and so on. Also, row sizes vary, another factor mitigating against rows/sec being a particularly useful metric.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Rows/sec is extremely useless if it's treated as if it's the instantaneous value out of that stage, which unfortunately many do. About the only place it may approach being somewhat useful is for the final link in a stream's path, and that's a stretch.
Does the sort use the primary key column as the join? If so, you may be able to take advantage of that (Don't Sort...Previously Sorted) depending on how the data's partitioned for the join.
Does the sort use the primary key column as the join? If so, you may be able to take advantage of that (Don't Sort...Previously Sorted) depending on how the data's partitioned for the join.
- james wiles
All generalizations are false, including this one - Mark Twain.
All generalizations are false, including this one - Mark Twain.
We are performing on remove duplicates on diferent keys.You can remove the duplicates in the sort stage itself if you are using same key for remove duplicates..
Increase the sort memory size in sort stage.Check whether it makes any difference in performance.
My Target is DB2 table.
No of rows per sec is 100000rows/sec.
ABHILASH
-
- Premium Member
- Posts: 301
- Joined: Thu Jul 14, 2005 10:27 am
- Location: Melbourne, Australia
- Contact:
Some questions:
- The stage is parallel?
- Partitioning is resulting in relatively even balance of rows across partitions?
- You're not performing an in-line 'pre-sort' on the input link are you? (I've seen this in many places)
- Have you considered playing with memory usage? ($APT_TSORT_STRESS_BLOCKSIZE)
- Your sort utility is "DataStage"
- The stage is parallel?
- Partitioning is resulting in relatively even balance of rows across partitions?
- You're not performing an in-line 'pre-sort' on the input link are you? (I've seen this in many places)
- Have you considered playing with memory usage? ($APT_TSORT_STRESS_BLOCKSIZE)
- Your sort utility is "DataStage"
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>