Page 1 of 1

Fifo \pipe error

Posted: Thu Oct 02, 2014 10:58 am
by taylor.hermann
Hi,

I've been dealing with an issue recently that I cant find a ton of causes for. I have a bunch of delta sequential jobs that use about 6 other jobs inside them, all with different invocation ids.

Now I've been load testing these delta sequential jobs currently. Only running around 8 of them at the same time. Where in production there will be upwards of 30+. However about 90% of the time 1 of these 8 jobs will fail. This is due to one of the jobs within the sequence failing. Now the job within the sequence and the sequence that fails is completely random. And I've also ran into this error running a single sequence. But the logs give me the following lines of code:

OPENSEQ '\\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL' called: 10:39:36 02 OCT 2014

It repeats this OPENSEQ message about every second for usually exactly 2 minutes. "application" is our project name, and "app_splunk_message.customer_natl" is the job that failed. Not sure what all the other junk is exactly.

Then afterwards it give me the following error:

Error setting up internal communications (fifo \\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL) STATUS() 2


The only real resource I found online about this issue is here
http://www-01.ibm.com/support/docview.w ... wg21445893

Our admin has confirmed its not virus scans, and there is plenty of disk space available while these jobs are running. So any more input / ideas is much appreciated!

Thanks,
Taylor

Posted: Thu Oct 02, 2014 2:00 pm
by ray.wurlod
Could it be that the multiple instances are all (or some of them) trying to access file \\.\pipe\Application-RT_SC274-App_Splunk_Message.CUSTOMER_NATL at the same time? The operating system only allows one writer at a time.

Posted: Fri Oct 03, 2014 6:25 am
by taylor.hermann
Although I can't see all the content, I can assume what your getting at. But this error has also happened running a single sequential job, so I dont think its a limitation accessing the file. But the chances of getting this error are just a lot less likely when running one job.

Posted: Fri Oct 03, 2014 7:08 am
by chulett
Interesting. I was assuming, much like Ray, that this happened when multiple instances were stepping on each other. But if it can happen while the job runs in isolation that's a whole 'nuther kettle of fish.

I believe that STATUS of 2 means "file not found". If you were on a UNIX server I'd suggest making sure your open files limit was high enough but no clue what the equivalent would be for Windows. I'd involve your official support provider on this one.

Posted: Fri Oct 03, 2014 7:18 am
by taylor.hermann
Yeah, we are currently working on that now too. There's been some talk that our environment may not be setup properly.

Posted: Fri Oct 03, 2014 8:29 am
by qt_ky
Try going through this Technote too:

http://www-01.ibm.com/support/docview.w ... wg21460111

Posted: Fri Oct 03, 2014 9:21 am
by chulett
Was going to link to that one as well but it is so UNIX-centric that I mostly decided to stick with this from the wrap-up paragraph:

"If the above tests do not isolate the cause of file system i/o problem, then it may be necessary to contact Information Server support for assistance in performing a system trace (truss or strace) of the dsapi process launching the failing jobs to track down the actual OS operations which are failing."

Posted: Fri Oct 03, 2014 2:08 pm
by taylor.hermann
We worked with an experienced consultant today, and he narrowed it down to a process on our server that was causing this issue. Something called "sh.exe" is randomly breaking and causing this error. Still have yet to determine why its happening.

**As a side note, we worked with IBM before to fix another timeout issue, and their solution was to set "APT_PM_USE_STANDALONE_EXE = 1". This was supposed to be avoiding the shell, and it resolved the immediate issue. **

However we assumed that the sh.exe would not be getting called anymore. But it's getting called somehow.
My question is now, does anyone know a way to completely avoid calling this "sh.exe" process? Or know why when its being called, it randomly breaks jobs?

Posted: Sat Oct 04, 2014 6:26 am
by qt_ky
The best way to avoid it is to run DataStage on UNIX. :lol:

Seriously though, I do not know.

Posted: Mon Oct 06, 2014 7:20 am
by taylor.hermann
Well I appreciate the time everyone has took to try and help!

Posted: Thu Oct 23, 2014 8:21 am
by taylor.hermann
For purposes of updating this post with the solution:

We found that a MKS Toolkit file (mkstk.dll) in system32 was showing up as unregistered by windows. And now that we have registered this .dll, this errors have seemed to vanished. IBM told us this was probably because our servers were not hooked up to internet(and still aren't) when Datastage was installed, so this file never got registered.