Page 1 of 1

ETL errors while running batch schedule

Posted: Tue Feb 28, 2006 11:48 am
by lakshya
Hi-

We are getting the following errors while running our batch schedule.Our batch runs group wise based on the dependencies where we have a bunch of jobs that kick off at the same time.The jobs run fine when they are run individually,but as a group they start throwing all different errors mentioned below. Is there any limit for the number of jobs we can initiate at the same time ? Or is there some other issue with the jobs ?

1 : main_program: Fatal Error: Service table transmission failed for node1

2 : (ps:Broken pipe. This may indicate a network APT_PM_CONDUCTOR_TIMEOUT to a
larger value (when unset, it defaults to 60) may alleviate this problem

3 : Wd. (fatal error from ): Error executing phantom command =>
DSD.OshMonitor record has been created in the '&PH&' file.
Unable to create PHANTOM process.

4 : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable

5 : main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.

6 : main_program: Unable to contact one or more Section Leaders.
Probable configuration problem; contact Orchestrate system administrator.

7. : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable

Please let us know if we need to modify something from our side to help reslove this issue.

Thanks

Posted: Tue Feb 28, 2006 12:10 pm
by ray.wurlod
Looks like your system can't handle the total load, as indicated by the "unable to start PHANTOM process" message. "PHANTOM" is just DataStage terminology for a background process. Involve your UNIX administrator to check the size of the process table and the size of the per-process number of background processes.

Posted: Tue Feb 28, 2006 6:35 pm
by DSguru2B
I have gone through this problem alot. This happens when the box configuration is not modified. Meaning. the Cache space for OSH should be increased from a default of probably 256mb to 1G which would be helpful enough. This shall be done by the unix admin.
Ray.. please comment or correct me.
Thanks

Posted: Tue Feb 28, 2006 10:04 pm
by ray.wurlod
I don't believe that the cache space for OSH being insufficient would lead to the "broken pipe" error thar was reported. This looks much more like a timeout possibly caused by too many processes on the machine (and therefore too long a wait to start another process).

Posted: Wed Mar 01, 2006 6:30 am
by kumar_s
It seems you have already altered the APT_PM_CONDUCTOR_TIMEOUT in administrator to avoid time outs.
Still you get error mention in point 4.
It also seems conductor process cannot reach section leader process on each processing node. Its Clearly ment your server is overloaded.
Try to avoid calling the number of job in parallel in Job sequence.

Posted: Fri Mar 03, 2006 8:18 am
by lakshya
Hi All-

Thanks for your responses on the topic.The issue is resolved now.

We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.

The batch finished successfully after the fix

Thanks

Posted: Sat Mar 04, 2006 4:33 am
by kumar_s
lakshya wrote:Hi All-

Thanks for your responses on the topic.The issue is resolved now.

We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.

The batch finished successfully after the fix

Thanks
Hi,
May i know the command used to find the maximum process allowed per user and the command to increase it.

Posted: Sat Mar 04, 2006 7:48 am
by ray.wurlod
It's usually a UNIX kernel parameter named something like NPROC. But the name varies on different UNIXes.

Posted: Sat Mar 04, 2006 8:00 am
by chulett
In other words, have a chat with an SA. :wink:

Posted: Sat Mar 04, 2006 9:06 pm
by ray.wurlod
The original error also mentioned that a file had been created in the &PH& directory in your project (on the server). Is there any useful diagnostic information in that file?

Posted: Mon Mar 06, 2006 8:31 am
by DSguru2B
One small tip to avoid these issues, is PLEASE gracefully logoff from DS or any DB services.
The processes are left hanging under each user hence increasing the load of processes. :D

Posted: Tue Apr 08, 2008 4:29 am
by sunayan_pal
i guess it is purely the resource problem, im my case 100% of the jobs run successfully in a re-run.
but what get write in &PH&. please suggest

Posted: Wed Aug 27, 2008 8:36 am
by Nagaraj
Want to get rid of PHP files in &PH& directory.

This is what is shown in ph files after each job run:-

[User@hostname &PH&]$ more DSD.RUN_37693_14850_558136
DataStage Job 337 Phantom 7978
The variable "APT_PERFORMANCE_DATA" is not in the environment.
DataStage Phantom Finished.
[User@hostname &PH&]$

This "APT_PERFORMANCE_DATA variable is there.

Posted: Wed Aug 27, 2008 3:22 pm
by ray.wurlod
This question is not related to the subject of this thread. Please begin a new thread.