"Unable to lock RT_CONFIG2660 file"

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
rsaliah
Participant
Posts: 65
Joined: Thu Feb 27, 2003 8:59 am

"Unable to lock RT_CONFIG2660 file"

Post by rsaliah »

Hi Gents,

I have a multi-instance batch that's called 9 times which in turn runs a bunch of multi-instance jobs. All is working as expected/required, but occasionally I get an aborted batch instance complaining of a "(fatal error from DSRunJob): Job control fatal error (-14)
(DSRunJob) Job JobName.Inst1 appears not to have started after 60 secs".

Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time. I haven't checked to confirm the assumption, that's my next step, but what I've also noticed is that in some cases the job that fails to start has a log info entry of "Unable to lock RT_CONFIG2660 file". What seems odd to me is that the log has no other entry and doesn't show that any attempts been made to start it yet the timing of the info corresponds to the 60 second timeout thing.

What I was hoping is if someone can tell me whether this problem could be down to system/kernel parameters before I try monitoring server activity.

Thanks,
Regu.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: "Unable to lock RT_CONFIG2660 file"

Post by chulett »

rsaliah wrote:Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time.
Yes, that's exactly what that means. Especially when you say you have a multi-instance batch that kicks of "a bunch" of multi-instance jobs.

You could take the time to verify your kernel parameters are ok per the Installation Guide, that's always a good thing. More than likely, this will need to be solved by adjusting values in the uvconfig file. Search the forum for things like T30FILES to get an idea of what could be the issue.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rsaliah
Participant
Posts: 65
Joined: Thu Feb 27, 2003 8:59 am

Re: "Unable to lock RT_CONFIG2660 file"

Post by rsaliah »

OK - I've checked the server as the jobs were running and at the time the error occurred there was approx 45% idle.

On the project in question there were 10 jobs already running at the time. These 10 and the one that failed are very simple in the design and have no hash file stages or routine calls. They source from UniData and target seq file and OCI.

The uvconfig file parameters appear to be more then sufficient for the processing.

Code: Select all

MFILES 450
T30FILE 450
ulimit -a for the user running shows:

Code: Select all

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
nofiles(descriptors) 1024
vmemory(kbytes)      unlimited
To me it looks like the error shouldn't be happening so I'm running out of ideas.

I think the key to the answer might be the info message from my earlier post
Unable to lock RT_CONFIG2660 file
but apart from dodgy settings in the uvconfig I'm not sure how else this could occur.

Any suggestion would be appreciated.

Regu.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The problem's with locks, not with sizes of anything.

Check whether RT_CONFIG2660 is already locked using list_readu command.

Restart DataStage when there is nothing happening. This will guarantee that all (memory-based) locks are cleared. Then try the job again, and let us know the outcome.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rsaliah
Participant
Posts: 65
Joined: Thu Feb 27, 2003 8:59 am

Post by rsaliah »

Thanks Ray,

I couldn't get the command to work but I did check for locks using DS.TOOLS and couldn't see anything before I ran the process.

The process calls a multi-instance job 9 times and it's one of these instances that fails to start after 60 seconds and shows the lock message. It affects a different job each time and is occasionally successful. If I rerun the instance immediately after the failure then it works.

There must be something locking so I'll keep digging.

Regu.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

All instances will need to take short-lived locks on the same RT_CONFIG file during startup. See if you can spread the startup requests by a small amount, say five seconds apart. SLEEP 5 will do it for you.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rsaliah
Participant
Posts: 65
Joined: Thu Feb 27, 2003 8:59 am

Post by rsaliah »

Thanks Ray,

Tried your suggestion and it still failed with the waiting 60 seconds problem, only this time I didn't get the "Unable to lock RT_CONFIG2660 file" message in the job being called.

Although I can't yet prove it the only possible cause has to be server/network load. The process isn't particularly CPU intensive but it does utilise the network quite heavily. So we delayed the start of part of the processing which last night appeared to solve the problem. Unfortunately we're not the only users on the server or DS installation so it could be that it was just a quiet time and we were lucky with our timing.

Thanks for your help.

Regu.
stan_taylor
Charter Member
Charter Member
Posts: 14
Joined: Tue Mar 04, 2003 3:27 pm

Post by stan_taylor »

What are your settings for the following uvconfig parameters:
  • RLTABSIZE
    GLTABSIZE
    MAXRLOCK
We had a similar problem a while back and basically the job startup would time out due to constraints on the message queues. We were instructed to try the following:
  • RLTABSIZE 150
    GLTABSIZE 150
    MAXRLOCK 149
You may want to give that a try. These values were for Solaris - I understand the same message queue issue can arise on HP, but don't know what the recommended values would be for that.
Thanks,
Stan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Widening the lock tables, which is what Stan suggests, reduces the probability of a clash of two record IDs hashing to the same lock-controlling semaphore.

It will not work in this case, because the lock ID ("RT_CONFIG2660") is identical (and will always therefore hash to the same lock-controlling semaphore) for all nine instances.

Can you afford to SLEEP for a longer period than 60 seconds between invocations - say 75 seconds? That should guarantee no contention.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rsaliah
Participant
Posts: 65
Joined: Thu Feb 27, 2003 8:59 am

Post by rsaliah »

For information. The only change we made was to start 4 of the 9 instances 30 minutes later and it appears to have fixed the problem. We've not had a reoccurrence of the problem since. I suspect that the network was being hammered because the overall process time has reduced by 2 minutes even though part of it starts 30 minutes later.
Post Reply