19 jobs failed with ds_ipcgetnext

PaulS · Post by **PaulS** » Sun Sep 23, 2012 2:48 pm

Hi sorry abut the title - i didn't want this dismissed with the usuall "timeout waiting for mutex" answer! I know about row/buffers etc...

Anyway, heres the problem. On Friday, one job, a pull from a SQL server db, failed with ""timeout waiting for mutex". I reset/reran it failed pretty much straightway, with other jobs running. I reran it on saturday morning, again with other jobs, it completed successfully.

Tonight, the same job,.. along with 18 others (in the same category) failed with mutex errors. I have other jobs in other categories running to completion, without issue. Just these jobs in this particular category failed - all from the same source,.. all pretty much failed at the same time.

As well as the ds_ipcput(), I'm also getting an ds_ipcgetnext() thereafter. Most server jobs have this error, most have it in a CInterProcess Stage.

Nothing has changed, no upgrades to DS or its jobs. I can't speak for the source system however.

Any help very much appreciated!!!

Thanks in advance
Paul

SURA · Post by **SURA** » Sun Sep 23, 2012 5:11 pm

Just a general questions:

1) Are these jobs are running for a while without any issues?

2) Though there is no changes in the DS side any changes made in network / SQL Server / OS?

ray.wurlod · Post by **ray.wurlod** » Sun Sep 23, 2012 6:26 pm

This is an example of a case where you may benefit from slightly increasing the buffer timeout value.

It's still related to total load on the machine, but if you can allow the IPC buffers a bit more grace time, you *should* get fewer timeouts.

Kryt0n · Post by **Kryt0n** » Sun Sep 23, 2012 6:42 pm

What's the timeout setting on the IPC stages? Are your jobs hitting this timeout value?

chulett · Post by **chulett** » Sun Sep 23, 2012 8:15 pm

One of the reasons I very rarely used them darn IPC stages.

PaulS · Post by **PaulS** » Mon Sep 24, 2012 12:24 am

SURA wrote:Just a general questions:

1) Are these jobs are running for a while without any issues?

2) Though there is no changes in the DS side any changes made in network / SQL Server / OS?

Yes,... these jobs have been running in 8.5 since i upgraded in April. No previous issues until Fridays one job failure,.. now all these 19.

We've had no network or o/s changes - not sure about the DB. I wasn't informed of any changes.

All te IPC stages are at defaults,..
Buffer: 128kb
Timeout: 10secs
Yes, it looks as though they are hitting the 10 secs and erroring

I can up all of them - but i don't understand why they're failing now. The category has 195 jobs,.. 19 failed wit this error.

PaulS · Post by **PaulS** » Mon Sep 24, 2012 5:11 am

one quick question - would this error every be thrown if the connection to the source database dropped?

Kryt0n · Post by **Kryt0n** » Mon Sep 24, 2012 5:25 pm

You mentioned all 19 were hitting the same DB, are these the only ones to hit that DB ? If so, your culprit is almost certainly a change on the DB front. How long do the queries take to run in a DB client? What database is it?

Is there any load on either the DataStage server or the DB server when trying to run the jobs? How many of these are you running at one time?

PaulS · Post by **PaulS** » Tue Sep 25, 2012 6:09 am

I have probably 7 or so jobs running simaltaniously into the same database. The only ones which are causing a problem are the IPC jobs.

Sorry,.. I'm starting to understan this a little more.

I got our unix administrator to report out the process activity over the period. There was a massive cpu spike at the time the jobs started to go wrong. The data didn't show load, how many processes were waiting, but I suspect given the utilisation of the 4 cpu's this were the problem is.

I am going to up the projects timeout parameter to 20 seconds... I have some questions which looked to not be answered here...

jdsmith575210 · Post by **jdsmith575210** » Tue Sep 25, 2012 10:45 am

We saw problems with IPC stages whenever the column metadata (datatype, length, display) defined in the stage didn't match what was coming from the source. Correcting the metadata helped but never resolved all of our problems. In the end, we removed the IPC stages whenever a job would fail with this error.

I don't see any mention of what you upgraded from or what database you're using, but we experienced lots of strange errors when upgrading from 7.5.2 to 8.1. You may want to look what what patches you had installed on your previous version and see if something similar needs to be applied to 8.5. We struggled for a year before discovering a patch that needed to be applied to our new environment.

SURA · Post by **SURA** » Tue Sep 25, 2012 8:48 pm

PaulS wrote:I am going to up the projects timeout parameter to 20 seconds... I have some questions which looked to not be answered here...

Nope; to me, you are pushing the issue , not solving it. Based on my understanding, if the load delayed due to network traffic or any other reasons you will face this issue again.

I am not sure about your job design.

1) Write the data into a file and the use a separate load job could resolve (95%).
2) Replace the IPC with file

PaulS · Post by **PaulS** » Thu Sep 27, 2012 2:11 am

We also upgraded from 7.5.2.. I hit mutex errors in 8.5 in job using a link partitioner/collector. First time I've seen it in an IPC.

From the mass of documents I've read, it appears IPC are more trouble than they are worth. Unfortuneately my category has 195 jobs, each with two IPCs.. I'm not about to re-write them all.

I've been looking at the sequencer and instead of uping the timeouts, I'm going to resequence the calling of the jobs. I have 7 strands running simaltaniously,.. the whole sequence takes 25mins, couple of the strands complete in 10 mins. I'll combine them and take some of the weight off the early period of heavy utilisation. There is some scope to smooth it out further if needed.

Thanks for everyones help here - very much appreciated!

Paul

DSXchange

19 jobs failed with ds_ipcgetnext

19 jobs failed with ds_ipcgetnext

Re: 19 jobs failed with ds_ipcgetnext

Re: 19 jobs failed with ds_ipcgetnext

Re: 19 jobs failed with ds_ipcgetnext