Sequencer Aborts Due to Error Code = -14

dwuser · Post by **dwuser** » Fri Apr 20, 2007 9:18 am

We got the same error message and many sequencers are failing daily...

We did the following to get some fruitfull results..

1. Installed the patch provided by IBM to increase the timeout to 600.
after impleenting this patch the sequencer waited for the whole timeout period and aborted, but it considerably reduced the number of Controller problem occurance.

2. Changed the sequence of job flow, and minimized the number of jobs starting at a particular time. This worked fine but we missed most SLA. We need to look into this option to design the sequencer accordingly

3. Reindex all the projects before everyday run. I dont know whether this would help. But I assume this would fasten the internal repository before the run.. Any comments on this.

When we monitored the server where we hosted DS is a 2GB solaris box and it not locked up in so many process at that time.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 20, 2007 3:41 pm

3) This will have no effect. The tables that govern run-time execution are not indexed, with the single exception of DS_JOBS, but its access uses the primary key (NAME) via its hashing algorithm, and therefore does not use any of the table's indexes.

As noted earlier, I believe your delays are primarily caused by the time required to materialize huge Oracle views from which you are selecting.

dwuser · Post by **dwuser** » Sat Apr 21, 2007 11:15 am

Ray,

Yes we are using huge select queries in the source stage of the jobs.
Will that cause the timeout problem. (code=-14)

chulett · Post by **chulett** » Sat Apr 21, 2007 11:49 am

No. The timeout is specific to the job starting - not about how long it takes to start processing records once it has started.

kcbland · Post by **kcbland** » Sat Apr 21, 2007 3:08 pm

dwuser wrote:Ray,

Yes we are using huge select queries in the source stage of the jobs.
Will that cause the timeout problem. (code=-14)

Ray is talking about the original posters questions. You have hijacked the original post with your problem.

Your answer is simple: On Solaris use "prstat -a" to monitor your server load and watch the CPU utilization CONSTANTLY. Leave it running all day long and watch your machine. You'll learn loads about how DataStage jobs interact with the environment. Watch what happens when your Sequence jobs start firing up other jobs. You'll probably BURY your machine, the CPUs will hit 100%. It's when your CPUs are so overutilized that timeout situations start happening.

You have three solutions: extend the timeout (no known method outside of a patch), reduced the number of simultaneous running processes consuming CPU resources (but occasionally you will still peak), or add more CPUs (easiest and most effective).

basanth · Post by **basanth** » Mon Jun 18, 2007 8:12 am

Hi dwuser,

Even we faced this issue intermittently. And applied the patch IBM provided even that did not help. I have checked /var/adm/messages. There was a entry in the messages file when ever the sequence aborted with -14 error. The server was trying to access an automount which was failing. This was causing for the sequence to be aborted with -14 error( time out waiting). We got sorted out and did not find any -14 errors after that so far.
Please note: this might be a coincidence, not the solution. Worth checking it coz i did not get the errors after that.

Regards
Basanth

dwuser wrote:We got the same error message and many sequencers are failing daily...

We did the following to get some fruitfull results..

1. Installed the patch provided by IBM to increase the timeout to 600.
after impleenting this patch the sequencer waited for the whole timeout period and aborted, but it considerably reduced the number of Controller problem occurance.

2. Changed the sequence of job flow, and minimized the number of jobs starting at a particular time. This worked fine but we missed most SLA. We need to look into this option to design the sequencer accordingly

3. Reindex all the projects before everyday run. I dont know whether this would help. But I assume this would fasten the internal repository before the run.. Any comments on this.

When we monitored the server where we hosted DS is a 2GB solaris box and it not locked up in so many process at that time.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 18, 2007 2:39 pm

Welcome aboard. :D

That's a useful tip - we usually assume everything we need is ready and waiting. Did not the UNIX Administrator notice that not all the mounts had succeeded?

basanth · Post by **basanth** » Tue Jun 19, 2007 1:46 am

Hello Ray,

After we checked these errors for a long time, we consulted the UNIX admin who inturn routed us to the network admin.
The network admin suggested that this problem was evident mainly during the nightly backups which causes the unicast flooding on all the ports or the Server.

So might be during this time the server was waiting for the automount to get ready. It was normally less than 2 mins time.

Regards
Basanth

ray.wurlod wrote:Welcome aboard. :D

That's a useful tip - we usually assume everything we need is ready and waiting. Did not the UNIX Administrator notice that not all the mounts had succeeded?

fridge · Post by **fridge** » Fri Jun 22, 2007 5:13 am

just an additional angle on this , we experienced these issues on a large implementation of datastage.

we managed to mitagate a lot of these timeouts but tuning some of the values in the uvconfig file (in $DSHOME)

basically these govern the number of internal universe files that can be open at one time, not forgetting that each job has at least three (RT_LOGxxx, RT_CONFIGxxx and RT_STATUSxxx) - when the limit is reached the engine has to close some handles to open up new ones - and this process can make the invocation of a job take longer.

The actual settings of these are quite conservative for an out the box installation, and upping some of them can allow more concurrent jobs to be invoked.

You will notice I am not telling you which settings to change, this is'nt because I am unhelpful (though it has been said), but because it can cause problems if configured wrong - so I would suggest you speak to your support provider for guidance , alternativliy do a search here for MFILES and you may find some tips

Hope this helps