Time Out waiting for Mutex Error

trojan · Post by **trojan** » Wed Mar 28, 2007 10:22 pm

A job failed with an error "ds_ipcgetnext - timeout waiting for mutex".

Our Operating System is SUN Solaris and we use Datastage Server Edition 7.5

This job is scheduled to run within a particular time window and it fails during that time window only. If we run it manually outside that time window it runs perfectly alright. Supposing it to be a database issue(many processes running simultaneously within that time window) we contacted our DBA's , but they said that the database activity is normal. We also tried using various perfermance improvement options.

We modified the performamce parameters such as buffer size and time out to the maximum possible (buffer size 1024 m and timeout 600 sec) but still the job failed.

We also tried changing the configuration file parameters : spintries and spinsleep without any success.

Job Details: The job includes a link collector which collects the data from 4 stored procedures and passes it to a transformer. The tranformer then calls a stored procedure.The job fails on this link to the stored procedure.

If anyone has faced the same error OR knows the resolution please share.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 29, 2007 4:54 am

Welcome aboard. :D

You are not the first to have encountered this problem. It seems from what you have tried that you have already searched the forum for possible answers. It might be useful to place before-stage and after-stage subroutines that execute timing points either side of the Transformer stage, so that you can prove that the delay is in the downstream SP.

trojan · Post by **trojan** » Thu Mar 29, 2007 6:44 am

any other way of gettin ideas???

chulett · Post by **chulett** » Thu Mar 29, 2007 6:55 am

Any way other than what? Asking here?

You could try searching the forums here as 'mutex errors' are not all that uncommon and have been discussed a number of times. You could call your Support provider, make them earn the money you pay them.

Or wait for someone else to answer. Anyone can. Some of us check the site many times a day, most don't - so you may just need to wait for the right person to show up.

I personally don't have any experience with the error so can't offer any advice other than what's already here.

kcbland · Post by **kcbland** » Thu Mar 29, 2007 7:25 am

Junk the Link Collector, write each output link to separate sequential files, concatenate them together, then load into your target table. No more mutex errors.

chulett · Post by **chulett** » Thu Mar 29, 2007 7:48 am

I should have been more specific - I don't have issues of this nature because I don't use the "problematic" stages like the Link Collector. I do as Ken suggests - separate output files and a post-processing concatenation - this is quick, easy to implement and problem free.

trojan · Post by **trojan** » Thu Mar 29, 2007 10:51 pm

yeah....i agree to your solution. But the problem is that the job is in production and we cannot modify the job.

We started facing this problem during the last 3 weeks. Prior to that the job was workin fine for more than a year.

so other than the design changes are there any performance parameters that we can modify??

Thanks a lot for your responses though.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 30, 2007 5:49 am

Ask the DBA why the SP is taking longer to respond than it formerly did. There may be a solution at that end.

Ask the UNIX sys admin whether the overall load has increased since when it was working.

kcbland · Post by **kcbland** » Fri Mar 30, 2007 7:31 am

The job uses a stage that has a built in timeout. That means on occasion the job will timeout. That means it is unstable. Just because it ran for a year doesn't mean it won't timeout. Even if it mysteriously starts working again doesn't mean it's stable. It can fail again in the future. The only 100% guaranteed solution is to STOP using this stage. Otherwise, it's just a gamble each time you use the LC that it will work.

jpr196 · Post by **jpr196** » Mon Apr 02, 2007 12:11 pm

On my current engagement, we had this same error and tried many of the suggested solutions. However, we finally discovered the reason for this was because we were running out of room in our tablespace. So, you may want to check with the dbas to make sure enough memory is being allocated if you haven't already.