How do I fix performance bottleneck at Sort stage?

Abhi700 · Post by **Abhi700** » Fri Feb 04, 2011 11:54 am

Hi,

One of my job is giving me serious bottleneck issues and any help in resolving that is highly appreciated.

I am using following job design :

Dataset --> joined to a DB2 table using join stage --> Filter Stage --> sort stage --> remove duplicate stage.

From Join stage to filter stage processing is fine, the job is able to process more than 100K rows per second. However at sort and remove duplicate stage this number drops to 2500 rows per second.

Please let me know if there anything I can do to improve processing speed in this job.

Thanks,

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Fri Feb 04, 2011 12:14 pm

Can you post your APT configuration file here please? Just wondering what you have setup for scratch, etc.

ray.wurlod · Post by **ray.wurlod** » Fri Feb 04, 2011 1:43 pm

The rows/sec figure is meaningless.

Think about operation of a sort. It can't output any rows till all its input rows have arrived. However the clock starts running immediately the job starts. Therefore the actual rows/sec out of the Sort stage - indeed out of any blocking stage - will be substantially under-reported.

mavrick21 · Post by **mavrick21** » Fri Feb 04, 2011 4:33 pm

Ray,

Is rows/sec figure meaningless only for links out of blocking stage or for all links in a job?

Thanks

ray.wurlod · Post by **ray.wurlod** » Fri Feb 04, 2011 4:57 pm

Depends what you want to use it for. As a general rule I'd say it's meaningless everywhere (only a very few exceptions), since the clock is always running, even during startup, waiting for I/O to return, and so on. Also, row sizes vary, another factor mitigating against rows/sec being a particularly useful metric.

chulett · Post by **chulett** » Fri Feb 04, 2011 7:02 pm

If you've been here for any length of time, you'd know Ray considers rows/second to be a particularly useless metric.

mavrick21 · Post by **mavrick21** » Fri Feb 04, 2011 7:41 pm

I know. He had told the same when I attended his training a few years back

jwiles · Post by **jwiles** » Fri Feb 04, 2011 11:21 pm

Rows/sec is extremely useless if it's treated as if it's the instantaneous value out of that stage, which unfortunately many do. About the only place it may approach being somewhat useful is for the final link in a stream's path, and that's a stretch.

Does the sort use the primary key column as the join? If so, you may be able to take advantage of that (Don't Sort...Previously Sorted) depending on how the data's partitioned for the join.

suman27 · Post by **suman27** » Sat Feb 05, 2011 2:57 pm

Hi Abhilash,

You can remove the duplicates in the sort stage itself if you are using same key for remove duplicates..
Increase the sort memory size in sort stage.Check whether it makes any difference in performance.

Regards,
Suman.

ThilSe · Post by **ThilSe** » Sun Feb 06, 2011 3:22 am

What is your target?

Abhi700 · Post by **Abhi700** » Sun Feb 06, 2011 12:13 pm

You can remove the duplicates in the sort stage itself if you are using same key for remove duplicates..
Increase the sort memory size in sort stage.Check whether it makes any difference in performance.

We are performing on remove duplicates on diferent keys.
My Target is DB2 table.
No of rows per sec is 100000rows/sec.

ThilSe · Post by **ThilSe** » Sun Feb 06, 2011 2:14 pm

What is the volume of records in the input dataset? Is the key used for partitioning distributing the records reasonably (doesn't create a bottleneck) without making the flow sequential?

Regards
Senthil

jhmckeever · Post by **jhmckeever** » Sun Feb 06, 2011 11:22 pm

Some questions:
- The stage is parallel?
- Partitioning is resulting in relatively even balance of rows across partitions?
- You're not performing an in-line 'pre-sort' on the input link are you? (I've seen this in many places)
- Have you considered playing with memory usage? ($APT_TSORT_STRESS_BLOCKSIZE)
- Your sort utility is "DataStage"