Job appears not to have started after 60 secs

Amos.Rosmarin · Post by **Amos.Rosmarin** » Sat Jun 25, 2005 5:36 am

Hi,

I get this msg when I run multiple processes:

JobNmae appears not to have started after 60 secs

I guess it is somthing to do with lack of resources, and related to the T30FILE/MFILES parameters.... am i right ?

My T30FILE is now 2000, MFILES=500 and the uvregen does not let me increase it more, what it the related kernel parameter.

Can someone give an idea ?

Amos

chulett · Post by **chulett** » Sat Jun 25, 2005 7:44 am

I think if it was T30FILE related you would get a different message. This, as far as I know, is just a sign of an overloaded machine bumping up against a hard-coded limit in the engine. You might want to define what 'multiple' means and describe your server's hardware.

See this thread for an example.

Amos.Rosmarin · Post by **Amos.Rosmarin** » Sat Jun 25, 2005 12:22 pm

Thanks,

It is solaris 2.9 with 16G and 8CPU

I read the link you gave me and it's the same problem, I thought raising the T30 but the uvregen does not let me go higher then 500.

There were about 40 instances of the job that failed and some other jobs that uses big (static) hash files + some px jobs

So the in terms of Datastage there was a lot of work going on but in terms of the machine, it was working hard but not 100% loaded (about 80-90% idle and 2G memory free)
the ulimit is 2000

I guess it is DS tunning issue, I run shmtest and changed the uvconfig according to the results and brought the mfiles and T30 to the max posible. I there somthing to do with the kernel parameters.

Thanks,
Amos

kduke · Post by **kduke** » Sat Jun 25, 2005 1:31 pm

How many jobs are you are starting at the same time?

Amos.Rosmarin · Post by **Amos.Rosmarin** » Sat Jun 25, 2005 2:10 pm

it's about 40 jobs

some are sequencers, for the ones that are executed in parallel I put a little sleep of 3 seconds between job

Thanks,
Amos

chulett · Post by **chulett** » Sat Jun 25, 2005 2:32 pm

Amos.Rosmarin wrote:So the in terms of Datastage there was a lot of work going on but in terms of the machine, it was working hard but not 100% loaded (about 80-90% idle and 2G memory free)

I'm not sure I'd call 80-90% idle "working hard".

Did you mean 10-20% idle?

You'll probably need to cut back on the number of jobs you run simultaneously, it doesn't sound like a 3 second sleep between launches is going to cut it. Maybe run portions of them in 'waves'?

Amos.Rosmarin · Post by **Amos.Rosmarin** » Sat Jun 25, 2005 2:44 pm

Of course, you are right

.... it's the opposite (staying till 23:00 at the office , which is the time central Europe right now)

The problem is that I must have the data as fast as posible, and the jobs are very short. each is different and they can not be joind.

Is there a upper limit for the T30 ,
for example, if I put the kernels per process open file limit to 2008, does T30 = 2000 makes sense ?

(still looking for the name of this kernel parameter)

Cheers,
Amos

chulett · Post by **chulett** » Sat Jun 25, 2005 2:50 pm

Yeesh, go home.

Amos.Rosmarin wrote:Is there a upper limit for the T30, for example, if I put the kernels per process open file limit to 2008, does T30 = 2000 makes sense ?

There's an upper limit to everything, I would think, but I'm afraid I don't know what that one is. Is it documented in the uvconfig file?

Then Amos.Rosmarin wrote:(still looking for the name of this kernel parameter)

You'll probably need to talk to your SA to find out for sure.

Amos.Rosmarin · Post by **Amos.Rosmarin** » Sat Jun 25, 2005 3:18 pm

Oops again , I see now that I confused between mfiles and T30 ....

I guess I'll goto sleep.

If anyone has some thoughts , i'll be happy to hear.

kcbland · Post by **kcbland** » Sat Jun 25, 2005 7:38 pm

You're on Solaris 2.9, so use

prstat -a

from a unix command line to monitor server process and load. DS jobs show up under "phantom" processes. If you have 8 cpus, then a fully utilized cpu by a process shows as 1/8 or 13%. If the sum of user processes (the -a option shows top 5 user summary at the bottom) approach 100%, then your machine is HAMMERED. DS has issues with job control and you need to talk to tech support about any patches that mitigate this issue.

Amos.Rosmarin · Post by **Amos.Rosmarin** » Sun Jun 26, 2005 2:50 am

Thanks Kenneth

The machine is not hammered, it's working hard but not 100% utilized.
It looks like a DS configuration issue.

Amos

roy · Post by **roy** » Sun Jun 26, 2005 8:38 am

Hi Amos,
how do you statr the jobs?
you did mention a 3 secs wait?
having 40 jobs?
even starting all 40 in the same sequence job will not be instantanious.
they will gradually all come up but some later then others.

can you specify a bit more on how do you run all of then in parallel?

Another thing you did mention multi instance jobs?
imagine 20 multi instance jobs bashing on the poor log simultaniously while a new instance comes up, not to mention if the log file got big....

Come to think of it (lmao), We had this in one of our customers.
Their problem was big log files of multi instance jobs.
Out solution was to purge the logs (we log all important info to an ascii log) periodically (depending on the frequency you run the jobs) to make sure they stay small.

IHTH,

kduke · Post by **kduke** » Sun Jun 26, 2005 3:03 pm

It think Roy's solution makes sense. Also changing 3 seconds to 5 or 10 makes sense. I cannot image that these all have to run in the fewest seconds as possible. If so then change your design.

If you are processing log files and need to do it as fast as possible then you need rotate the log file and process the old log file and then you do not lose transactions. I assume you have that kind of situation is the need for speed in a situation like this.

Describe why 40 processes need to run at the same time when your machine is incapable of doing this? Especially when these jobs are small. There has got to be another solution available to you. Explain your options.

Luciana · Post by **Luciana** » Fri Dec 02, 2005 10:52 am

Code: Select all

Job control fatal error (-14)  
(DSRunJob) Job "Name" appears not to have started after 60 secs

The error -14 happens when the server is overloaded in some intervals. The parameter regarding the time of 60 seconds no there is as being altered.

There are some parameters of the file uvconfig that can be adjusted:

1. Stop the service of DataStage using the command:
$DSHOME/bin/uv -admin -stop

2. it bends the value of the parameters (GLTABSZ, RLTABSZ and MAXRLOCK) in the file $DSHOME/uvconfig.
Ex.: If the values is (75,75,74) respectively, inform the values (150,150,149).

Note: This value cannot be greater then RLTABSZ - 1.

3. As user dsadm it executes:
$DSHOME/bin/uv - admin - regen

Note: If the command above not to execute with success, alter the values of the variables (Nmemoff, Cmemoff, Pmemoff and

Dmemoff) in the file uvconfig for "0x0", and executes the command of the step 3 again.

4. Restart of DataStage using the command:
$DSHOME/bin/uv - admin - start[/code]

ArndW · Post by **ArndW** » Fri Dec 02, 2005 10:59 am

Hello Luciana,

you seem to have picked up on a long-dead thread (from June) on your response. Also, I think that the solution you proposed doesn't address the original problem. The 60-second value is hardcoded and the solutions were to make the job actually start quicker on a heavily loaded system.

Your solution will have a positive effect for systems that have group and record lock contention. There was no indication in the thread that this was the case, and changing these values is something to be done only when necessary and with care.