Page 1 of 2

ds_ipcgetnext() timeout waiting for mutex

Posted: Tue Mar 27, 2012 5:11 am
by rohitagarwal15
ArndW wrote:sd_ds - I went to search and typed in "ds_ipcgetnext" and got 38 different threads as a result. Most of those looked quite informative.
I also faced similar type of error "ds_ipcgetnext() timeout waiting for mutex".
But we have already set the environment varaible namely "DS_IPCPUT_OLD_TIMEOUT_BEHAVIOR" to 1, still we face this error.
I searched the forum and found something related to tuning of some uvconfig parameters namely SPINTRIES and SPINSLEEP.
Can any one help me in providing info how i can tune this parameters so that we will not face this error again.
Presently once we recompile all jobs and now running them so they are running fine now.

Posted: Tue Mar 27, 2012 7:23 am
by chulett
:!: Split from this older topic so you can control your own destiny.

First off, confirm for us you are still talking about a Server job on a Windows server as those "SPIN" variables depend entirely on your operating system, from what I recall. Also let us know what DataStage version you are running.

ds_ipcgetnext() - timeout waiting for mutex

Posted: Mon Apr 02, 2012 2:59 am
by rohitagarwal15
Datastage version is 8.1 with fixpack2.
Operating system is AIX 6.1

Posted: Mon Apr 02, 2012 6:40 am
by chulett
Server or Parallel job? What does your job design look like?

Posted: Wed Apr 11, 2012 3:40 am
by rohitagarwal15
chulett wrote:Server or Parallel job? What does your job design look like?
Its Parallel job. We are using shared container in the job, apart from this we are using transformer, copy, join and funnel stage.

Posted: Wed Apr 11, 2012 4:35 pm
by ray.wurlod
Ignore everything in the error message after "timeout". Everything else is about the mechanism.

The problem is in the inter-process communication (ipc) function that gets the next buffer-ful of data. It has exceeded its wait interval.

Why? There are a number of possible reasons, but they usually hover around the fact that the server or network between servers is overloaded and/or not fast enough.

Posted: Fri Apr 13, 2012 2:07 am
by rohitagarwal15
Thanks Ray for the guidance and information.
Meanwhile i have spoken to the network team but they responded that there is no network issue. Again when the job was aborted that time we ask our unix admin to check for server utilization but they also said eveything is fine.

Posted: Fri Apr 13, 2012 2:33 am
by kandyshandy
Do you get this error when you rerun the job?

Posted: Fri Apr 13, 2012 9:38 pm
by qt_ky
In DataStage Administrator:

Have you tried increasing the inter process timeout setting in the project properties from default 10 seconds to max 600 seconds?

Have you tried increasing the project's Parallel, Operator-specific DSIPC_OPEN_TIMEOUT environment variable from the default of 30?

Posted: Sun Apr 15, 2012 8:48 pm
by kandyshandy
Try whatever others suggested here. If no luck, contact IBM.

In 2009, when we migrated our jobs from 7.5 to 8.0.1 & 8.1 eventually, we got this error. I don't remember exactly how it was fixed. But i guess we got a patch from IBM to fix this.

But keep in mind that that was early days of version 8 ;)

Posted: Mon Apr 16, 2012 3:55 am
by rohitagarwal15
I have increased the value of DSIPC_OPEN_TIMEOUT to 300 and also i added one parameter DS_IPCPUT_OLD_TIMEOUT_BEHAVIOR, value set to1. After doing all these things my job runs fine but next day again it is aborted and when i recompile the job and rerun it then it runs fine. now a days it is getting aborted but not very frequently.

Posted: Mon Apr 16, 2012 2:51 pm
by ray.wurlod
What happens if you reset the aborted job, rather than recompile?

Posted: Mon Apr 16, 2012 7:04 pm
by qt_ky
What inter process timeout setting values have you tried also? Those are set per project in Administrator.

Have you opened any support case after trying all the setting changes?

Posted: Mon Apr 16, 2012 11:46 pm
by rohitagarwal15
ray.wurlod wrote:What happens if you reset the aborted job, rather than recompile? ...
If i reset the job then it is again aborted but if i recompile it then it runs fine.

Posted: Tue Apr 17, 2012 1:13 am
by kandyshandy
Rohit, Even if you revert those 2 things(Admin setting & Env. parameter), your job should run fine sometimes and abort sometimes. That's why i asked you initially whether you face this error during your rerun. Check IBM site if this is an known issue.