"Unable to lock RT_CONFIG2660 file"
Moderators: chulett, rschirm, roy
"Unable to lock RT_CONFIG2660 file"
Hi Gents,
I have a multi-instance batch that's called 9 times which in turn runs a bunch of multi-instance jobs. All is working as expected/required, but occasionally I get an aborted batch instance complaining of a "(fatal error from DSRunJob): Job control fatal error (-14)
(DSRunJob) Job JobName.Inst1 appears not to have started after 60 secs".
Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time. I haven't checked to confirm the assumption, that's my next step, but what I've also noticed is that in some cases the job that fails to start has a log info entry of "Unable to lock RT_CONFIG2660 file". What seems odd to me is that the log has no other entry and doesn't show that any attempts been made to start it yet the timing of the info corresponds to the 60 second timeout thing.
What I was hoping is if someone can tell me whether this problem could be down to system/kernel parameters before I try monitoring server activity.
Thanks,
Regu.
I have a multi-instance batch that's called 9 times which in turn runs a bunch of multi-instance jobs. All is working as expected/required, but occasionally I get an aborted batch instance complaining of a "(fatal error from DSRunJob): Job control fatal error (-14)
(DSRunJob) Job JobName.Inst1 appears not to have started after 60 secs".
Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time. I haven't checked to confirm the assumption, that's my next step, but what I've also noticed is that in some cases the job that fails to start has a log info entry of "Unable to lock RT_CONFIG2660 file". What seems odd to me is that the log has no other entry and doesn't show that any attempts been made to start it yet the timing of the info corresponds to the 60 second timeout thing.
What I was hoping is if someone can tell me whether this problem could be down to system/kernel parameters before I try monitoring server activity.
Thanks,
Regu.
Re: "Unable to lock RT_CONFIG2660 file"
Yes, that's exactly what that means. Especially when you say you have a multi-instance batch that kicks of "a bunch" of multi-instance jobs.rsaliah wrote:Since it happens randomly and on different jobs I've assumed that it down to the server being over-utilised at that particular time.
You could take the time to verify your kernel parameters are ok per the Installation Guide, that's always a good thing. More than likely, this will need to be solved by adjusting values in the uvconfig file. Search the forum for things like T30FILES to get an idea of what could be the issue.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Re: "Unable to lock RT_CONFIG2660 file"
OK - I've checked the server as the jobs were running and at the time the error occurred there was approx 45% idle.
On the project in question there were 10 jobs already running at the time. These 10 and the one that failed are very simple in the design and have no hash file stages or routine calls. They source from UniData and target seq file and OCI.
The uvconfig file parameters appear to be more then sufficient for the processing.
ulimit -a for the user running shows:
To me it looks like the error shouldn't be happening so I'm running out of ideas.
I think the key to the answer might be the info message from my earlier post
Any suggestion would be appreciated.
Regu.
On the project in question there were 10 jobs already running at the time. These 10 and the one that failed are very simple in the design and have no hash file stages or routine calls. They source from UniData and target seq file and OCI.
The uvconfig file parameters appear to be more then sufficient for the processing.
Code: Select all
MFILES 450
T30FILE 450
Code: Select all
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 1024
vmemory(kbytes) unlimited
I think the key to the answer might be the info message from my earlier post
but apart from dodgy settings in the uvconfig I'm not sure how else this could occur.Unable to lock RT_CONFIG2660 file
Any suggestion would be appreciated.
Regu.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The problem's with locks, not with sizes of anything.
Check whether RT_CONFIG2660 is already locked using list_readu command.
Restart DataStage when there is nothing happening. This will guarantee that all (memory-based) locks are cleared. Then try the job again, and let us know the outcome.
Check whether RT_CONFIG2660 is already locked using list_readu command.
Restart DataStage when there is nothing happening. This will guarantee that all (memory-based) locks are cleared. Then try the job again, and let us know the outcome.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thanks Ray,
I couldn't get the command to work but I did check for locks using DS.TOOLS and couldn't see anything before I ran the process.
The process calls a multi-instance job 9 times and it's one of these instances that fails to start after 60 seconds and shows the lock message. It affects a different job each time and is occasionally successful. If I rerun the instance immediately after the failure then it works.
There must be something locking so I'll keep digging.
Regu.
I couldn't get the command to work but I did check for locks using DS.TOOLS and couldn't see anything before I ran the process.
The process calls a multi-instance job 9 times and it's one of these instances that fails to start after 60 seconds and shows the lock message. It affects a different job each time and is occasionally successful. If I rerun the instance immediately after the failure then it works.
There must be something locking so I'll keep digging.
Regu.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
All instances will need to take short-lived locks on the same RT_CONFIG file during startup. See if you can spread the startup requests by a small amount, say five seconds apart. SLEEP 5 will do it for you.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thanks Ray,
Tried your suggestion and it still failed with the waiting 60 seconds problem, only this time I didn't get the "Unable to lock RT_CONFIG2660 file" message in the job being called.
Although I can't yet prove it the only possible cause has to be server/network load. The process isn't particularly CPU intensive but it does utilise the network quite heavily. So we delayed the start of part of the processing which last night appeared to solve the problem. Unfortunately we're not the only users on the server or DS installation so it could be that it was just a quiet time and we were lucky with our timing.
Thanks for your help.
Regu.
Tried your suggestion and it still failed with the waiting 60 seconds problem, only this time I didn't get the "Unable to lock RT_CONFIG2660 file" message in the job being called.
Although I can't yet prove it the only possible cause has to be server/network load. The process isn't particularly CPU intensive but it does utilise the network quite heavily. So we delayed the start of part of the processing which last night appeared to solve the problem. Unfortunately we're not the only users on the server or DS installation so it could be that it was just a quiet time and we were lucky with our timing.
Thanks for your help.
Regu.
-
- Charter Member
- Posts: 14
- Joined: Tue Mar 04, 2003 3:27 pm
What are your settings for the following uvconfig parameters:
Thanks,
Stan
- RLTABSIZE
GLTABSIZE
MAXRLOCK
- RLTABSIZE 150
GLTABSIZE 150
MAXRLOCK 149
Thanks,
Stan
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Widening the lock tables, which is what Stan suggests, reduces the probability of a clash of two record IDs hashing to the same lock-controlling semaphore.
It will not work in this case, because the lock ID ("RT_CONFIG2660") is identical (and will always therefore hash to the same lock-controlling semaphore) for all nine instances.
Can you afford to SLEEP for a longer period than 60 seconds between invocations - say 75 seconds? That should guarantee no contention.
It will not work in this case, because the lock ID ("RT_CONFIG2660") is identical (and will always therefore hash to the same lock-controlling semaphore) for all nine instances.
Can you afford to SLEEP for a longer period than 60 seconds between invocations - say 75 seconds? That should guarantee no contention.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
For information. The only change we made was to start 4 of the 9 instances 30 minutes later and it appears to have fixed the problem. We've not had a reoccurrence of the problem since. I suspect that the network was being hammered because the overall process time has reduced by 2 minutes even though part of it starts 30 minutes later.