ETL errors while running batch schedule
Moderators: chulett, rschirm, roy
ETL errors while running batch schedule
Hi-
We are getting the following errors while running our batch schedule.Our batch runs group wise based on the dependencies where we have a bunch of jobs that kick off at the same time.The jobs run fine when they are run individually,but as a group they start throwing all different errors mentioned below. Is there any limit for the number of jobs we can initiate at the same time ? Or is there some other issue with the jobs ?
1 : main_program: Fatal Error: Service table transmission failed for node1
2 : (ps:Broken pipe. This may indicate a network APT_PM_CONDUCTOR_TIMEOUT to a
larger value (when unset, it defaults to 60) may alleviate this problem
3 : Wd. (fatal error from ): Error executing phantom command =>
DSD.OshMonitor record has been created in the '&PH&' file.
Unable to create PHANTOM process.
4 : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable
5 : main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.
6 : main_program: Unable to contact one or more Section Leaders.
Probable configuration problem; contact Orchestrate system administrator.
7. : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable
Please let us know if we need to modify something from our side to help reslove this issue.
Thanks
We are getting the following errors while running our batch schedule.Our batch runs group wise based on the dependencies where we have a bunch of jobs that kick off at the same time.The jobs run fine when they are run individually,but as a group they start throwing all different errors mentioned below. Is there any limit for the number of jobs we can initiate at the same time ? Or is there some other issue with the jobs ?
1 : main_program: Fatal Error: Service table transmission failed for node1
2 : (ps:Broken pipe. This may indicate a network APT_PM_CONDUCTOR_TIMEOUT to a
larger value (when unset, it defaults to 60) may alleviate this problem
3 : Wd. (fatal error from ): Error executing phantom command =>
DSD.OshMonitor record has been created in the '&PH&' file.
Unable to create PHANTOM process.
4 : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable
5 : main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.
6 : main_program: Unable to contact one or more Section Leaders.
Probable configuration problem; contact Orchestrate system administrator.
7. : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable
Please let us know if we need to modify something from our side to help reslove this issue.
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Looks like your system can't handle the total load, as indicated by the "unable to start PHANTOM process" message. "PHANTOM" is just DataStage terminology for a background process. Involve your UNIX administrator to check the size of the process table and the size of the per-process number of background processes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I don't believe that the cache space for OSH being insufficient would lead to the "broken pipe" error thar was reported. This looks much more like a timeout possibly caused by too many processes on the machine (and therefore too long a wait to start another process).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
It seems you have already altered the APT_PM_CONDUCTOR_TIMEOUT in administrator to avoid time outs.
Still you get error mention in point 4.
It also seems conductor process cannot reach section leader process on each processing node. Its Clearly ment your server is overloaded.
Try to avoid calling the number of job in parallel in Job sequence.
Still you get error mention in point 4.
It also seems conductor process cannot reach section leader process on each processing node. Its Clearly ment your server is overloaded.
Try to avoid calling the number of job in parallel in Job sequence.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Hi All-
Thanks for your responses on the topic.The issue is resolved now.
We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.
The batch finished successfully after the fix
Thanks
Thanks for your responses on the topic.The issue is resolved now.
We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.
The batch finished successfully after the fix
Thanks
Hi,lakshya wrote:Hi All-
Thanks for your responses on the topic.The issue is resolved now.
We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.
The batch finished successfully after the fix
Thanks
May i know the command used to find the maximum process allowed per user and the command to increase it.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The original error also mentioned that a file had been created in the &PH& directory in your project (on the server). Is there any useful diagnostic information in that file?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 49
- Joined: Fri May 11, 2007 12:24 am
- Location: kolkata
Want to get rid of PHP files in &PH& directory.
This is what is shown in ph files after each job run:-
[User@hostname &PH&]$ more DSD.RUN_37693_14850_558136
DataStage Job 337 Phantom 7978
The variable "APT_PERFORMANCE_DATA" is not in the environment.
DataStage Phantom Finished.
[User@hostname &PH&]$
This "APT_PERFORMANCE_DATA variable is there.
This is what is shown in ph files after each job run:-
[User@hostname &PH&]$ more DSD.RUN_37693_14850_558136
DataStage Job 337 Phantom 7978
The variable "APT_PERFORMANCE_DATA" is not in the environment.
DataStage Phantom Finished.
[User@hostname &PH&]$
This "APT_PERFORMANCE_DATA variable is there.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: