Mutex errors

ShaneMuir · Post by **ShaneMuir** » Tue Oct 19, 2004 7:04 am

Hello everybody,

I have searched around the forums to find information about 'timeout waiting for mutex' and have found some useful information but nothing which clears up my problem which is as follows:

I am running a very simple job from a flat file which uses a couple of hash file lookups - when i run one version it works perfectly. However when I release the job and run it, I keep getting 'timeout waiting for mutex' errors.

It appears that it has something to do with the Hash file lookups but why does it only happen in the released version of the job? How can this be cleared up? It has been suggested that it could be caused by the speed of the CPU amongst other things. Should I set the timeout period higher or lower?

Any help you can provide me on this would be most appreciated.

Regards
Shane Muir

chulett · Post by **chulett** » Tue Oct 19, 2004 7:14 am

What kinds of things are you doing in your job, what stages are you using? IPC? Do you have Row Buffering enabled? What operating system? When you release this job, do you migrate it to another (production) server or is it running on the same server as the unreleased job?

ShaneMuir wrote:Should I set the timeout period higher or lower?

Higher.

ShaneMuir · Post by **ShaneMuir** » Tue Oct 19, 2004 7:25 am

Thanks for the quick response there Craig,

In response to your questions

1. What kind of things am i doing in the job - its all very simple (or at least i thought it was).

a. Flat file comes in
b. Transform ads a couple of extra fields based on input data
c. Rather large transform - 3 hash files which are simple mappings for country codes, currency codes and account groupings. Also in this transform stage are approx 17 output streams.
d. each stream after that produces a separate line for a flat file
e. collector stage to collect all the 17 output streams
f. flat file output

2. Row buffering is enabled, as is the Inter process option.
3. Unix OS
4. Actually the release job is on the same server as the unreleased job

I set the timeout higher but still get the multiple mutex errors (generally about 1 for each output stream).

Thanks again
Shane

chulett · Post by **chulett** » Tue Oct 19, 2004 8:27 am

ShaneMuir wrote:3. Unix OS

Actually, I meant which specific O/S... HP/UX? It seems to get more than its fair share of mutex errors.

I set the timeout higher but still get the multiple mutex errors

What value do you currently have it set to?

vmcburney · Post by **vmcburney** » Tue Oct 19, 2004 6:06 pm

There are spin tries and spin wait settings in the DataStage config file but I'm yet to find description of how these should be set to avoid the mutex problem. Your link collector is almost certainly the trigger for the mutex problem if you can redesign your job to not use one you should be okay. 17 output streams is a bit unusual, is it possible to redesign your job to have a transformer followed by a pivot stage. The transformer would do all the lookup, append the extra columns, the pivot would break each row into up to 17 rows based on the values in the extra columns.

ShaneMuir · Post by **ShaneMuir** » Wed Oct 20, 2004 2:38 am

Craig:
In answer to your question, the operating system is HP-UX 11i and the timeout was originally set at 10 secs and i moved it up to 300 secs. The thing is that the time out errors occur after about 6 seconds anyway.

Vincent:
Unfortunately I don't think the Pivot table will work as each of the output streams has a different data structure (eg one has 36 fields the next has 91 etc). Its almost like XML but just different enough to be annoying.

Thanks again guys for all your input on this.

tonystark622 · Post by **tonystark622** » Wed Oct 20, 2004 7:13 am

I can't remember if the Link Collector requires Row Buffering and/or Inter Process to be enabled. Have you tried disabling Row Buffering?

Tony

ray.wurlod · Post by **ray.wurlod** » Wed Oct 20, 2004 4:02 pm

Link Collector requires row buffering enabled.

(You can know this because of the call to ipcopen() in the error message - ipc is "inter process communication".) Also, it's in the manual.

tonystark622 · Post by **tonystark622** » Wed Oct 20, 2004 7:01 pm

I thought it might, but couldn't find it when I looked in the manual. Ah well, at least I was trying to help.

Tony

ShaneMuir · Post by **ShaneMuir** » Thu Oct 21, 2004 6:35 am

Hi again everyone

Just to let you know, we still haven't found the cause of the problem, but as it turns out, it was not an isolated problem, another part of the project on a separate site also had the same problem, so at present we have just migrated only the job executable to production and it works fine. Will get back to you when we find a permanent solution.

Thanks again for all your help
Shane Muir

jeredleo · Post by **jeredleo** » Thu Oct 21, 2004 7:49 am

Just curious what version of DS you have 7. ?? I saw that you use Link Collector and then continue on and indicate that your 17 output files have different layouts. I didn't think you could use Link Collector to collect multiple layouts? I also know that we upgraded to 7.5 a couple months ago and had problems where you need to have Key's set up on the input files all have to match as well as your output file. This wasn't a problem in earlier releases of DS as far as having some inputs having a key identified and some not. However with the 7.5 release, we ran into problems where depending how it was set up, our jobs would abend and garbage up the output stream or in other cases it would actually drop records. Just a heads up. Now in regards to the mutex errors, we had problems on adjusting the performance tab on the job. If you have it set to get Project defaults, when moving to production and 're-compiling' the job I would have to assume it would pull in your production project's defaults. Have your DS Admin verify that the project defaults are the same between the two projects. Just something to look at.

JB

ray.wurlod · Post by **ray.wurlod** » Thu Oct 21, 2004 4:00 pm

MUTEX locks - also called "smart semaphores" - live in the operating system.

Instead of waiting, asleep, on a regular semaphore and waking when notified, the idea of a smart semaphore is to wait, retrying, so that you can wake faster.

Which is good in theory.

As machines became faster, the limit on retries before abandoning the wait could be achieved more quickly, which is what causes the errors. By increasing the allowed number of retries (SPINTRIES) or the time for which one is prepared to wait (SPINWAIT), you should be able to reduce the frequency of these errors.

But they will still occur if you have to wait too long for a resource.

In DataStage this might, for example, indicate a badly tuned set of lock tables, extensive use of directory-type files (which only have "group 1"), or just waiting too long for some resource that is governed by use of a semaphore for single threading, such as accessing the T30FILE table or the disk cache file/free chains.

ShaneMuir · Post by **ShaneMuir** » Fri Oct 22, 2004 1:48 am

Hi again

In response the the link collector with 17 different inputs, i suppose i should have mentioned that each stream although different is then transformed into a single concatenated text field so that they are the same length and can be fed into a link collector, before being sent to a single text output.

And also checking now with Admin to see if the server settings are the same.

SM