ETL errors while running batch schedule

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
lakshya
Participant
Posts: 19
Joined: Fri Jan 21, 2005 2:39 pm

ETL errors while running batch schedule

Post by lakshya »

Hi-

We are getting the following errors while running our batch schedule.Our batch runs group wise based on the dependencies where we have a bunch of jobs that kick off at the same time.The jobs run fine when they are run individually,but as a group they start throwing all different errors mentioned below. Is there any limit for the number of jobs we can initiate at the same time ? Or is there some other issue with the jobs ?

1 : main_program: Fatal Error: Service table transmission failed for node1

2 : (ps:Broken pipe. This may indicate a network APT_PM_CONDUCTOR_TIMEOUT to a
larger value (when unset, it defaults to 60) may alleviate this problem

3 : Wd. (fatal error from ): Error executing phantom command =>
DSD.OshMonitor record has been created in the '&PH&' file.
Unable to create PHANTOM process.

4 : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable

5 : main_program: **** Parallel startup failed ****
This is usually due to a configuration error, such as
not having the Orchestrate install directory properly
mounted on all nodes, rsh permissions not correctly
set (via /etc/hosts.equiv or .rhosts), or running from
a directory that is not mounted on all nodes. Look for
error messages in the preceding output.

6 : main_program: Unable to contact one or more Section Leaders.
Probable configuration problem; contact Orchestrate system administrator.

7. : node1: Fatal Error: Unable to start ORCHESTRATE process on node
node1 (ps): APT_PMPlayer::APT_PMPlayer: fork() failed, Resource
temporarily unavailable

Please let us know if we need to modify something from our side to help reslove this issue.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Looks like your system can't handle the total load, as indicated by the "unable to start PHANTOM process" message. "PHANTOM" is just DataStage terminology for a background process. Involve your UNIX administrator to check the size of the process table and the size of the per-process number of background processes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

I have gone through this problem alot. This happens when the box configuration is not modified. Meaning. the Cache space for OSH should be increased from a default of probably 256mb to 1G which would be helpful enough. This shall be done by the unix admin.
Ray.. please comment or correct me.
Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I don't believe that the cache space for OSH being insufficient would lead to the "broken pipe" error thar was reported. This looks much more like a timeout possibly caused by too many processes on the machine (and therefore too long a wait to start another process).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

It seems you have already altered the APT_PM_CONDUCTOR_TIMEOUT in administrator to avoid time outs.
Still you get error mention in point 4.
It also seems conductor process cannot reach section leader process on each processing node. Its Clearly ment your server is overloaded.
Try to avoid calling the number of job in parallel in Job sequence.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
lakshya
Participant
Posts: 19
Joined: Fri Jan 21, 2005 2:39 pm

Post by lakshya »

Hi All-

Thanks for your responses on the topic.The issue is resolved now.

We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.

The batch finished successfully after the fix

Thanks
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

lakshya wrote:Hi All-

Thanks for your responses on the topic.The issue is resolved now.

We have increased the number of processes allowed per user on the unix box from the existing 500 and increased it to a higher limit which was sufficient to handle all the processes kicked off by the ETL's.

The batch finished successfully after the fix

Thanks
Hi,
May i know the command used to find the maximum process allowed per user and the command to increase it.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's usually a UNIX kernel parameter named something like NPROC. But the name varies on different UNIXes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

In other words, have a chat with an SA. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The original error also mentioned that a file had been created in the &PH& directory in your project (on the server). Is there any useful diagnostic information in that file?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

One small tip to avoid these issues, is PLEASE gracefully logoff from DS or any DB services.
The processes are left hanging under each user hence increasing the load of processes. :D
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
sunayan_pal
Participant
Posts: 49
Joined: Fri May 11, 2007 12:24 am
Location: kolkata

Post by sunayan_pal »

i guess it is purely the resource problem, im my case 100% of the jobs run successfully in a re-run.
but what get write in &PH&. please suggest
regards
sunayan
Nagaraj
Premium Member
Premium Member
Posts: 383
Joined: Thu Nov 08, 2007 12:32 am
Location: Bangalore

Post by Nagaraj »

Want to get rid of PHP files in &PH& directory.

This is what is shown in ph files after each job run:-

[User@hostname &PH&]$ more DSD.RUN_37693_14850_558136
DataStage Job 337 Phantom 7978
The variable "APT_PERFORMANCE_DATA" is not in the environment.
DataStage Phantom Finished.
[User@hostname &PH&]$

This "APT_PERFORMANCE_DATA variable is there.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This question is not related to the subject of this thread. Please begin a new thread.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply