Fatal Error: Fork faile - Previous posts didn't help much

thamark · Post by **thamark** » Mon Aug 13, 2007 10:13 am

Hi All,

I tried with all the posts available regarding the error i am getting and i couldn't get one which solves this problem.

We get this error when we have job with 300 or more operator in it.

node_node7: Fatal Error: Unable to start ORCHESTRATE process on node node7 (etld01): APT_PMPlayer::APT_PMPlayer: fork() failed, Not enough space
node_node4: Fatal Error: Unable to start ORCHESTRATE process on node node4 (etld01): APT_PMPlayer::APT_PMPlayer: fork() failed, Not enough space

I verified that we have enough physical space to run the job and we have enough(32GB) memory as well.

Here is the info about the ulimit output.

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 8192
coredump(blocks) unlimited
nofiles(descriptors) 200000
vmemory(kbytes) unlimited

This error happens as part of startup process, means without even start processing first record we get this error.

We have not configured our uvconfig file or other environment variable any suggestions from your prior experience with this issue are welcome.

We do get a error message at the end indicating it could be possible cause of the network error, but not sure what is the details to ask from network admin.

Any help on this issue is appreciated.

Thanks & Regards
Kannan

bcarlson · Post by **bcarlson** » Mon Aug 13, 2007 10:51 am

300+ operators in a single job? Are you sure you wouldn't be better served by multiple jobs? Division of labor is not a bad thing. Sometimes the bigger the job, the harder it is to maintain. And the performance may suffer to a point where 1 job is not necessarily running better than multiple. Just a thought to consider.

Now, as I am sure you are aware from searching the other postings, this error is usually related to memory or scratch space. How do you know how much your job needs vs. what has been allocated?

I am guesing with 300+ operators you must have a significant number of sorts, aggregations, partitioners, joins, merges, etc. These all use scratch space. The more operators you have that use scratch space, the more space you need. 100 MB of data going in may convert to 100's of MB required in scratch space depending on how many operators you have.

Are there joins that would be better lookups? That would use memory vs. scratch space and help avoid a partitioner (also reducing scratch space). I had a job that had to lookup 5-6 different sets of codes and was originally written with joins. It worked on production, but woudl fail with space issues (like yours) in our dev environment. I converted them to lookups and not only got rid of the space issues, but it ran in a fraction of the time.

Brad.

DSguru2B · Post by **DSguru2B** » Mon Aug 13, 2007 10:56 am

Solid piece of advice from Brad there. Split up your jobs, your system can only support so many operations at a given instance.

kris · Post by **kris** » Mon Aug 13, 2007 11:02 am

We get this error when we have job with 300 or more operator in it.

node_node7: Fatal Error: Unable to start ORCHESTRATE process on node node7 (etld01): APT_PMPlayer::APT_PMPlayer: fork() failed, Not enough space
node_node4: Fatal Error: Unable to start ORCHESTRATE process on node node4 (etld01): APT_PMPlayer::APT_PMPlayer: fork() failed, Not enough space

300 operators seem to be a huge number for me in a single job. However, fork() fail could occur when either the server is overloaded or # of processes per user (MAXUPROC) exceeded the limit.

What always helped me in the past is to get to know the capabilities of servers (DEV/PROD) and design jobs accordingly.

Best regards,
Kris~

thamark · Post by **thamark** » Mon Aug 13, 2007 1:33 pm

Thank you for comming up with various solutions.

I am sorry it is not 300 or more operator it is actually no of processes

We will go with the path of spliting job, if we dont get solution for the same.

All we are tring to do here is

We get a file, which is source data for 6 tables and this comes with complete data, so we do CDC and these 6 tables are having dependacy between them, so we merge them back to the main flow get all tables loaded.

following are things i have done so far.

I used

Code: Select all

 lsof

and

Code: Select all

ps -eu

to identify no of process and files opened during execution and it never past the limit set on the server or at user level.

I was watching

Code: Select all

top

as well and more than 80% idle.
CPU states: 93.2% idle, 3.2% user, 3.7% kernel, 0.0% iowait, 0.0% swap
Memory: 32G phys mem, 30G free mem, 10G swap, 10G free swap

above figure doesn't change much.

This job does contain two parallel shared container and three local shared container as well.

After going thru all these data i am not sure this is just a problem of big single job issue or we are missing some configuration from our side to do.

Please do send your thoughts and any other info.
[/b]

ds_user78 · Post by **ds_user78** » Mon Aug 13, 2007 2:15 pm

Does the same job run with single node configuration file?

We also had a similar error and when we tried with a config file with less number of nodes it started working. Something to do with number of UNIX processes I think.

ArndW · Post by **ArndW** » Mon Aug 13, 2007 3:21 pm

The 'not enough space' message can be somewhat ambiguous at times, making this problem a bit more difficult to debug.

The advice to test it with a 1-node configuration file is sound, if the error remains then possibly you have other causes than actual physical disk or memory space.

You should look at iostat or glance (or your tool of choice) to monitor disk activity, this is most likely going to be the bottleneck as your CPU and memory aren't being exercised much.

The buffering mechanism in PX will also start landing interim data to disk between stages when there are differences in speed and the buffers start filling up, so in a job with many operators but little or no repartitioning or sorting you might still end up using your temp and scratch space - please try to monitor that while executing this job to see if you might be reaching limits there.

The space message is during the fork() call, which doesn't point towards disk but memory allocation. Could you put in APT_DUMP_SCORE to see how many actual processes get started (again, best with a 1-node configuration to start with).

If you don't make any progress, it might also help if you could post your configuration file. The .uvconfig settings are most likely not germane to this problem, but it would be helpful to know the APT_CONFIG file layout and perhaps know which platform you are running on.

Also, what sort of a network is involved here (referring to your mention of a possible network error)? Are the disks and database local to the PX server?

thamark · Post by **thamark** » Thu Aug 16, 2007 3:29 pm

I am sorry for delayed response.

I was trying pin point this issue with particular issue, so i was creating a job with 128 stages and

1) Job ran sucessfully with 4node.

2) Job failed, if i ran with 8node.

While monitoring it used nothing but no of process.

IBM sugested us increase the SWAP from 10 GB to 50 GB and i was thinking that it is not going to help since sucessful run took no memory, but i was wrong.

Job ran sucessfully after increasing it to 50GB

As per IBM job estimates how much memory it needs while initiating the job even though it is not going to use anything. Anyway this issue is resolved now.

Please let me know, if you guys have more information on this. I am confused here...

harshada · Post by **harshada** » Fri Aug 17, 2007 6:43 am

We once got a fork failed error , similar to what you have mentioned. The problem was that the MAXUPROC id had reached its limit. The system engineers had to reset this value in unix box and restart it.

DSXchange

Fatal Error: Fork faile - Previous posts didn't help much

Fatal Error: Fork faile - Previous posts didn't help much

Re: Fatal Error: Fork faile - Previous posts didn't help muc

Thanks

I am sorry for delayed response