DSXchange

Posted: **Fri Nov 24, 2006 4:08 am**

Hi,

Lately we're getting a lot of errors on
wCleanupCBOAggregates..JobControl (@AGR_CBO_ORIGINATING): Controller
problem: Error calling DSRunJob(CleanupAgrCBO), code=-14
[Timed out while waiting for an event] .
Before anyone mentions this: I have did a search and found that this is because of an overload of the system. I've also found the post about ecase 70788 (a patch to set DSD.RUN from 60 seconds
to 600 seconds ) which is offcource a workaround, not a solution

However: If I look at the load of our unix server this is not at it's limits when these errors occur(checked number of processes/CPU/memory/disk space), so it seems more of a datastage overload then a server overload.

Does anyone have an idea about the deciding factor in this?
Example is there a difference between
-50 jobs with 2 sequential stages being started together
-2 jobs with 50 sequential stages being started together
- 2 jobs with 5 stages, each using 10 parallel processes.

this way we can check what the best way is to resolve this: do we mainly sequentialize (if that's a word?) the workflows to start less parallel jobs, do we split jobs into multiple smaller jobs, or do we decrease the parallelism inside the jobs?

Posted: **Fri Nov 24, 2006 4:17 am**

You should activate your APT_DUMP_SCORE variable in order to see how many pids are actually started. This will depend on your APT_CONFIG node configuration as well as whether or not your database is partitioned and you use that functionality.

Increasing/decreasing the number of nodes in your configuration file will make a significant difference in number of process fired off by PX and in many cases it is more efficient to use a 1-node configuration (even on a system with many CPUs) than a 4-node or more configuration.

Posted: **Fri Nov 24, 2006 12:58 pm**

Use some of the other reporting environment variables, to capture the process IDs of the player processes and their memory consumption. Relate these back to your UNIX system monitoring.