File descriptor out of range in fd_set

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Bill_G
Premium Member
Premium Member
Posts: 74
Joined: Thu Oct 20, 2005 9:34 am

File descriptor out of range in fd_set

Post by Bill_G »

I have a job that was running just fine on a single 4 way machine. Then, last week I decided to add a second 4 way machine and built a clustered environment (not sure if the issue is related, but thought it worth mentioning).

I received the following errors on both the master and second node:

Item #: 64
Event ID: 281
Timestamp: 2007-06-28 04:15:14
Type: Fatal
User Name: dsadm
Message: node_node2: Fatal Error: File descriptor out of range in fd_set (requested 1,025, limit 1,023

Item #: 65
Event ID: 282
Timestamp: 2007-06-28 04:15:14
Type: Fatal
User Name: dsadm
Message: main_program: The Section Leader on node node2 has terminated unexpectedly.

Item #: 66
Event ID: 283
Timestamp: 2007-06-28 04:15:14
Type: Fatal
User Name: dsadm
Message: node_node0: Fatal Error: File descriptor out of range in fd_set (requested 1,024, limit 1,023


I have switched back to the single server config file and attempted to run, but received the same error. I have interrogated the SA and he claims not to have changed anything on the master server.

Any ideas? I am sure this is a parameter, but i am at a loss right now.

Thanks in advance
Bill_G
Premium Member
Premium Member
Posts: 74
Joined: Thu Oct 20, 2005 9:34 am

Post by Bill_G »

some more informaton about the job...

it is a PX job, reads from Oracle, executes approximately 15 or so normal and sparse lookups directly from Oracle, uses a merge then writes to a data set.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I wonder if you are hitting a configured UNIX limit on file descriptors - what platform are you running on? If you run it with less nodes on a single server, does the error persist?
philolaos
Premium Member
Premium Member
Posts: 11
Joined: Wed Sep 10, 2003 7:38 am
Location: Montreal, Canada
Contact:

Post by philolaos »

I'm running into a similar "file descriptor" problem.

Code: Select all

main_program: Fatal Error: File descriptor out of range in fd_set (requested 1312, limit 1023
I'm running DataStage 8.1 on a RedHat Linux Advanced Server 4 using glibc 2.3.4.

My job showing the problem is pretty simple; it reads data from a dataset and load directly into Oracle through the PX operator orawrite with RCP enable.

The problem only arise when I use sufficiently big data structure layout (more columns, not more data). My problem is not related to the dataset size or the number of nodes.

I guess it is because my Oracle library (liborchoracle10gi686.so), which I am using through orawrite operator, have been compiled when the GNU C Library arbitrary limit on the size of an fd_set object was 1024 (through the directive FD_SETSIZE=1024).

The only thing that I need to clarify now is...

do this liborchoracle10gi686.so file is the result of a compiling process ran behind the scene during the installation of Information Server, picking up my system FD_SETSIZE of the moment

OR

do this file has been copied as is from the installation package?

Bill_G, how did you solved your problem?

Any new ideas?
Stéphane Poirier
SHP Consult SA
philolaos
Premium Member
Premium Member
Posts: 11
Joined: Wed Sep 10, 2003 7:38 am
Location: Montreal, Canada
Contact:

Post by philolaos »

I forgot to say that this problem is not related to the limit on the number of open files for the system or for my user. In fact, I were able to open more than 1024 files at the same time (and monitored that with lsof) on this system but without relying on the GNU C Library.

Code: Select all

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 10000
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 212992
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Stéphane Poirier
SHP Consult SA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Interesting... thanks for posting that.
-craig

"You can never have too many knives" -- Logan Nine Fingers
philolaos
Premium Member
Premium Member
Posts: 11
Joined: Wed Sep 10, 2003 7:38 am
Location: Montreal, Canada
Contact:

Post by philolaos »

Someone ask me to clarify my previous post because he was not able to use it to go around that limit in his setup. Let's try to explain a bit more the usefulness of my suggestion.


It is just a way to mitigate the risk to cross that 1023 limit. I mean, if your job, solely by itself, consume more than 1023 out of that counter during the course of its execution, your out of luck!

It will only reduce the ratio of that counter already consumed when you start the job itself. If you move the start point from a multi-level indented Job Activity to a fairly simple "dsjob -run" command, you will start with a bigger share of that counter still "available".

In my work environment, at that time, we had a couple of jobs cousuming around 1000 files descriptors overall but failed nonetheless, at first, because the way we use to launch them already consumed more than 200! By lauching them differently, we were able to squeeze the "biggest" jobs as possible with that limited counter.

But the limit is still there...
Stéphane Poirier
SHP Consult SA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So... would this help at all? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
philolaos
Premium Member
Premium Member
Posts: 11
Joined: Wed Sep 10, 2003 7:38 am
Location: Montreal, Canada
Contact:

Post by philolaos »

Nope! We already checked that extensively.

Although, we may look at it this way; the problem is not with the values present in the "limits.conf" file of Datastage Server that is running that code but with the "limits.conf" on the system that compiled the shared object (*.so) that our Datastage code is relying on through the call of the previously mentionned PX operator to load in Oracle.

It's at the Build Time of the product that the problem lies, IMHO.
Stéphane Poirier
SHP Consult SA
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ah... gotcha.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply