As indicated by Craig's Knowledge Center link, lowering the
pd_npages setting can help performance related to real-time applications that delete files. Here is what command help says about the same setting:
Code: Select all
# ioo -h pd_npages
Help for tunable pd_npages:
Purpose:
Specifies the number of pages that should be deleted in one chunk from RAM when a file is deleted.
Values:
Default: 4096
Range: 1 - 524288
Type: Dynamic
Unit: 4KB pages
Tuning:
The maximum value indicates the largest file size, in pages. Real-time applications that experience sluggish response time while files are being deleted. Tuning this option is only useful for real-time applications. If real-time response is critical, adjusting this option may improve response time by spreading the removal of file pages from RAM more evenly over a workload.
All of our systems (v11.3.x and v11.5.0.2+SP2) are running with the default value 4096. We run a mix of mostly DataStage jobs along with a small number of real-time DataStage jobs using ISD, none of which involve deleting files. I would not know how much difference this particular setting might make without changing it, rebooting, and testing a typical workload. We have never had to tweak this setting.
Here are a number of other ideas:
Compare the
lparstat -i output across servers. Does it look as expected with memory and CPU allocations? Can you share the output (minus the server names)?
What are your LPAR priority values set to (Desired Variable Capacity Weight) and do they match across LPARs?
Are you sharing a physical server where any of the other LPARs might be overloaded or allowed to use the default shared CPU pool (all of the cores)?
Could you put a workload on that doesn't involve any disk I/O and do some comparisons across servers? Something like a row generator stage, set to run in parallel (default is sequential), to a transformer that does some mathematical functions... If that runs well, then run another test job that does local disk I/O but does not touch Oracle, and so on.
For what it may be worth, here is how we have /etc/security/limits set, which is a bit different than yours:
Code: Select all
default:
stack_hard = -1
fsize = -1
core = -1
cpu = -1
data = -1
rss = -1
stack = -1
nofiles = -1
It may not be practical but have you considered a workaround, such as a daily reboot, at least on the engine tier?
Is AIX process accounting active or enabled? We found that everything ran better when it was disabled.
Since you have reconfigured your disks at least once or twice, have you gone back and changed GLTABSZ from default 75 to 300 and run the uvregen? Several tech notes suggest increasing RLTABSZ/GLTABSZ/MAXRLOCK values to 300/300/299, especially when
multi-instance jobs are used.
Just to confirm, is the performance problem only limited to parallel job startup time only, and not the actual run-time after startup, and not affecting sequence jobs or server jobs?