ds_ipcgetnext() timeout waiting for mutex

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

ds_ipcgetnext() timeout waiting for mutex

Post by rohitagarwal15 »

ArndW wrote:sd_ds - I went to search and typed in "ds_ipcgetnext" and got 38 different threads as a result. Most of those looked quite informative.
I also faced similar type of error "ds_ipcgetnext() timeout waiting for mutex".
But we have already set the environment varaible namely "DS_IPCPUT_OLD_TIMEOUT_BEHAVIOR" to 1, still we face this error.
I searched the forum and found something related to tuning of some uvconfig parameters namely SPINTRIES and SPINSLEEP.
Can any one help me in providing info how i can tune this parameters so that we will not face this error again.
Presently once we recompile all jobs and now running them so they are running fine now.
Rohit
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:!: Split from this older topic so you can control your own destiny.

First off, confirm for us you are still talking about a Server job on a Windows server as those "SPIN" variables depend entirely on your operating system, from what I recall. Also let us know what DataStage version you are running.
-craig

"You can never have too many knives" -- Logan Nine Fingers
rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

ds_ipcgetnext() - timeout waiting for mutex

Post by rohitagarwal15 »

Datastage version is 8.1 with fixpack2.
Operating system is AIX 6.1
Rohit
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Server or Parallel job? What does your job design look like?
-craig

"You can never have too many knives" -- Logan Nine Fingers
rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

Post by rohitagarwal15 »

chulett wrote:Server or Parallel job? What does your job design look like?
Its Parallel job. We are using shared container in the job, apart from this we are using transformer, copy, join and funnel stage.
Rohit
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ignore everything in the error message after "timeout". Everything else is about the mechanism.

The problem is in the inter-process communication (ipc) function that gets the next buffer-ful of data. It has exceeded its wait interval.

Why? There are a number of possible reasons, but they usually hover around the fact that the server or network between servers is overloaded and/or not fast enough.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

Post by rohitagarwal15 »

Thanks Ray for the guidance and information.
Meanwhile i have spoken to the network team but they responded that there is no network issue. Again when the job was aborted that time we ask our unix admin to check for server utilization but they also said eveything is fine.
Rohit
kandyshandy
Participant
Posts: 597
Joined: Fri Apr 29, 2005 6:19 am
Location: Singapore

Post by kandyshandy »

Do you get this error when you rerun the job?
Kandy
_________________
Try and Try again…You will succeed atlast!!
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

In DataStage Administrator:

Have you tried increasing the inter process timeout setting in the project properties from default 10 seconds to max 600 seconds?

Have you tried increasing the project's Parallel, Operator-specific DSIPC_OPEN_TIMEOUT environment variable from the default of 30?
Choose a job you love, and you will never have to work a day in your life. - Confucius
kandyshandy
Participant
Posts: 597
Joined: Fri Apr 29, 2005 6:19 am
Location: Singapore

Post by kandyshandy »

Try whatever others suggested here. If no luck, contact IBM.

In 2009, when we migrated our jobs from 7.5 to 8.0.1 & 8.1 eventually, we got this error. I don't remember exactly how it was fixed. But i guess we got a patch from IBM to fix this.

But keep in mind that that was early days of version 8 ;)
Kandy
_________________
Try and Try again…You will succeed atlast!!
rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

Post by rohitagarwal15 »

I have increased the value of DSIPC_OPEN_TIMEOUT to 300 and also i added one parameter DS_IPCPUT_OLD_TIMEOUT_BEHAVIOR, value set to1. After doing all these things my job runs fine but next day again it is aborted and when i recompile the job and rerun it then it runs fine. now a days it is getting aborted but not very frequently.
Rohit
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What happens if you reset the aborted job, rather than recompile?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

What inter process timeout setting values have you tried also? Those are set per project in Administrator.

Have you opened any support case after trying all the setting changes?
Choose a job you love, and you will never have to work a day in your life. - Confucius
rohitagarwal15
Participant
Posts: 102
Joined: Thu Sep 17, 2009 1:23 am

Post by rohitagarwal15 »

ray.wurlod wrote:What happens if you reset the aborted job, rather than recompile? ...
If i reset the job then it is again aborted but if i recompile it then it runs fine.
Rohit
kandyshandy
Participant
Posts: 597
Joined: Fri Apr 29, 2005 6:19 am
Location: Singapore

Post by kandyshandy »

Rohit, Even if you revert those 2 things(Admin setting & Env. parameter), your job should run fine sometimes and abort sometimes. That's why i asked you initially whether you face this error during your rerun. Check IBM site if this is an known issue.
Kandy
_________________
Try and Try again…You will succeed atlast!!
Post Reply