Page 1 of 1

"Unable to lock RT_CONFIG2660 file"

Posted: Mon Dec 05, 2005 6:33 am
by rsaliah
Hi Gents,

I have a multi-instance batch that's called 9 times which in turn runs a bunch of multi-instance jobs. All is working as expected/required, but occasionally I get an aborted batch instance complaining of a "(fatal error from DSRunJob): Job control fatal error (-14)
(DSRunJob) Job JobName.Inst1 appears not to have started after 60 secs".

Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time. I haven't checked to confirm the assumption, that's my next step, but what I've also noticed is that in some cases the job that fails to start has a log info entry of "Unable to lock RT_CONFIG2660 file". What seems odd to me is that the log has no other entry and doesn't show that any attempts been made to start it yet the timing of the info corresponds to the 60 second timeout thing.

What I was hoping is if someone can tell me whether this problem could be down to system/kernel parameters before I try monitoring server activity.

Thanks,
Regu.

Re: "Unable to lock RT_CONFIG2660 file"

Posted: Mon Dec 05, 2005 8:22 am
by chulett
rsaliah wrote:Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time.
Yes, that's exactly what that means. Especially when you say you have a multi-instance batch that kicks of "a bunch" of multi-instance jobs.

You could take the time to verify your kernel parameters are ok per the Installation Guide, that's always a good thing. More than likely, this will need to be solved by adjusting values in the uvconfig file. Search the forum for things like T30FILES to get an idea of what could be the issue.

Re: "Unable to lock RT_CONFIG2660 file"

Posted: Mon Dec 05, 2005 8:55 am
by rsaliah
OK - I've checked the server as the jobs were running and at the time the error occurred there was approx 45% idle.

On the project in question there were 10 jobs already running at the time. These 10 and the one that failed are very simple in the design and have no hash file stages or routine calls. They source from UniData and target seq file and OCI.

The uvconfig file parameters appear to be more then sufficient for the processing.

Code: Select all

MFILES 450
T30FILE 450
ulimit -a for the user running shows:

Code: Select all

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
nofiles(descriptors) 1024
vmemory(kbytes)      unlimited
To me it looks like the error shouldn't be happening so I'm running out of ideas.

I think the key to the answer might be the info message from my earlier post
Unable to lock RT_CONFIG2660 file
but apart from dodgy settings in the uvconfig I'm not sure how else this could occur.

Any suggestion would be appreciated.

Regu.

Posted: Mon Dec 05, 2005 1:28 pm
by ray.wurlod
The problem's with locks, not with sizes of anything.

Check whether RT_CONFIG2660 is already locked using list_readu command.

Restart DataStage when there is nothing happening. This will guarantee that all (memory-based) locks are cleared. Then try the job again, and let us know the outcome.

Posted: Tue Dec 06, 2005 4:21 am
by rsaliah
Thanks Ray,

I couldn't get the command to work but I did check for locks using DS.TOOLS and couldn't see anything before I ran the process.

The process calls a multi-instance job 9 times and it's one of these instances that fails to start after 60 seconds and shows the lock message. It affects a different job each time and is occasionally successful. If I rerun the instance immediately after the failure then it works.

There must be something locking so I'll keep digging.

Regu.

Posted: Tue Dec 06, 2005 1:28 pm
by ray.wurlod
All instances will need to take short-lived locks on the same RT_CONFIG file during startup. See if you can spread the startup requests by a small amount, say five seconds apart. SLEEP 5 will do it for you.

Posted: Thu Dec 08, 2005 6:42 am
by rsaliah
Thanks Ray,

Tried your suggestion and it still failed with the waiting 60 seconds problem, only this time I didn't get the "Unable to lock RT_CONFIG2660 file" message in the job being called.

Although I can't yet prove it the only possible cause has to be server/network load. The process isn't particularly CPU intensive but it does utilise the network quite heavily. So we delayed the start of part of the processing which last night appeared to solve the problem. Unfortunately we're not the only users on the server or DS installation so it could be that it was just a quiet time and we were lucky with our timing.

Thanks for your help.

Regu.

Posted: Thu Dec 08, 2005 3:49 pm
by stan_taylor
What are your settings for the following uvconfig parameters:
  • RLTABSIZE
    GLTABSIZE
    MAXRLOCK
We had a similar problem a while back and basically the job startup would time out due to constraints on the message queues. We were instructed to try the following:
  • RLTABSIZE 150
    GLTABSIZE 150
    MAXRLOCK 149
You may want to give that a try. These values were for Solaris - I understand the same message queue issue can arise on HP, but don't know what the recommended values would be for that.
Thanks,
Stan

Posted: Fri Dec 09, 2005 6:12 pm
by ray.wurlod
Widening the lock tables, which is what Stan suggests, reduces the probability of a clash of two record IDs hashing to the same lock-controlling semaphore.

It will not work in this case, because the lock ID ("RT_CONFIG2660") is identical (and will always therefore hash to the same lock-controlling semaphore) for all nine instances.

Can you afford to SLEEP for a longer period than 60 seconds between invocations - say 75 seconds? That should guarantee no contention.

Posted: Tue Dec 13, 2005 5:20 am
by rsaliah
For information. The only change we made was to start 4 of the 9 instances 30 minutes later and it appears to have fixed the problem. We've not had a reoccurrence of the problem since. I suspect that the network was being hammered because the overall process time has reduced by 2 minutes even though part of it starts 30 minutes later.