Sequencer Aborts Due to Error Code = -14

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

dwuser
Premium Member
Premium Member
Posts: 13
Joined: Thu Apr 12, 2007 11:27 am
Location: Sunnyvale

Post by dwuser »

We got the same error message and many sequencers are failing daily...

We did the following to get some fruitfull results..

1. Installed the patch provided by IBM to increase the timeout to 600.
after impleenting this patch the sequencer waited for the whole timeout period and aborted, but it considerably reduced the number of Controller problem occurance.

2. Changed the sequence of job flow, and minimized the number of jobs starting at a particular time. This worked fine but we missed most SLA. We need to look into this option to design the sequencer accordingly

3. Reindex all the projects before everyday run. I dont know whether this would help. But I assume this would fasten the internal repository before the run.. Any comments on this.

When we monitored the server where we hosted DS is a 2GB solaris box and it not locked up in so many process at that time.
Bharathi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

3) This will have no effect. The tables that govern run-time execution are not indexed, with the single exception of DS_JOBS, but its access uses the primary key (NAME) via its hashing algorithm, and therefore does not use any of the table's indexes.

As noted earlier, I believe your delays are primarily caused by the time required to materialize huge Oracle views from which you are selecting.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dwuser
Premium Member
Premium Member
Posts: 13
Joined: Thu Apr 12, 2007 11:27 am
Location: Sunnyvale

Post by dwuser »

Ray,

Yes we are using huge select queries in the source stage of the jobs.
Will that cause the timeout problem. (code=-14)
Bharathi
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

No. The timeout is specific to the job starting - not about how long it takes to start processing records once it has started.
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

dwuser wrote:Ray,

Yes we are using huge select queries in the source stage of the jobs.
Will that cause the timeout problem. (code=-14)
Ray is talking about the original posters questions. You have hijacked the original post with your problem.

Your answer is simple: On Solaris use "prstat -a" to monitor your server load and watch the CPU utilization CONSTANTLY. Leave it running all day long and watch your machine. You'll learn loads about how DataStage jobs interact with the environment. Watch what happens when your Sequence jobs start firing up other jobs. You'll probably BURY your machine, the CPUs will hit 100%. It's when your CPUs are so overutilized that timeout situations start happening.

You have three solutions: extend the timeout (no known method outside of a patch), reduced the number of simultaneous running processes consuming CPU resources (but occasionally you will still peak), or add more CPUs (easiest and most effective).
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
basanth
Participant
Posts: 5
Joined: Fri Sep 08, 2006 7:38 am
Location: London

Post by basanth »

Hi dwuser,

Even we faced this issue intermittently. And applied the patch IBM provided even that did not help. I have checked /var/adm/messages. There was a entry in the messages file when ever the sequence aborted with -14 error. The server was trying to access an automount which was failing. This was causing for the sequence to be aborted with -14 error( time out waiting). We got sorted out and did not find any -14 errors after that so far.
Please note: this might be a coincidence, not the solution. Worth checking it coz i did not get the errors after that.

Regards
Basanth

dwuser wrote:We got the same error message and many sequencers are failing daily...

We did the following to get some fruitfull results..

1. Installed the patch provided by IBM to increase the timeout to 600.
after impleenting this patch the sequencer waited for the whole timeout period and aborted, but it considerably reduced the number of Controller problem occurance.

2. Changed the sequence of job flow, and minimized the number of jobs starting at a particular time. This worked fine but we missed most SLA. We need to look into this option to design the sequencer accordingly

3. Reindex all the projects before everyday run. I dont know whether this would help. But I assume this would fasten the internal repository before the run.. Any comments on this.

When we monitored the server where we hosted DS is a 2GB solaris box and it not locked up in so many process at that time.
Basanth
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard. :D

That's a useful tip - we usually assume everything we need is ready and waiting. Did not the UNIX Administrator notice that not all the mounts had succeeded?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
basanth
Participant
Posts: 5
Joined: Fri Sep 08, 2006 7:38 am
Location: London

Post by basanth »

Hello Ray,

After we checked these errors for a long time, we consulted the UNIX admin who inturn routed us to the network admin.
The network admin suggested that this problem was evident mainly during the nightly backups which causes the unicast flooding on all the ports or the Server.

So might be during this time the server was waiting for the automount to get ready. It was normally less than 2 mins time.

Regards
Basanth


ray.wurlod wrote:Welcome aboard. :D

That's a useful tip - we usually assume everything we need is ready and waiting. Did not the UNIX Administrator notice that not all the mounts had succeeded?
Basanth
fridge
Premium Member
Premium Member
Posts: 136
Joined: Sat Jan 10, 2004 8:51 am

Post by fridge »

just an additional angle on this , we experienced these issues on a large implementation of datastage.

we managed to mitagate a lot of these timeouts but tuning some of the values in the uvconfig file (in $DSHOME)

basically these govern the number of internal universe files that can be open at one time, not forgetting that each job has at least three (RT_LOGxxx, RT_CONFIGxxx and RT_STATUSxxx) - when the limit is reached the engine has to close some handles to open up new ones - and this process can make the invocation of a job take longer.

The actual settings of these are quite conservative for an out the box installation, and upping some of them can allow more concurrent jobs to be invoked.

You will notice I am not telling you which settings to change, this is'nt because I am unhelpful (though it has been said), but because it can cause problems if configured wrong - so I would suggest you speak to your support provider for guidance , alternativliy do a search here for MFILES and you may find some tips

Hope this helps
Post Reply