[torqueusers] pbs_mom: Bad file descriptor (9) in TMomFinalizeJob3

Yaroslav Halchenko maui at onerussian.com
Thu Oct 27 09:39:37 MDT 2005


Dear Developers,

I am using torque 1.2.0p6. Everything was working smooth before I
upgraded all the nodes (running debian unstable).
Now some of the nodes' moms refuse to run the tasks reporting:

node14 pbs_mom: Bad file descriptor (9) in TMomFinalizeJob3, read of
pipe for sid failed for job 224817.node2 (0 of 8 bytes)
<more of log below>

I can't detect any dependence between the failing nodes and the node's
configuration and state -- all the nodes seems to be uniformly
configured. Nevertheless, On one of such nodes, restart of pbs_mom
helped. On another one, not pbs_mom restart, neither reboot of the node
helped, so I placed a system reservation back on the node, so no tasks
get scheduled for it.

Please advise where to look to solve the problem


10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type Commit request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in TMomFinalizeJob3, read of pipe for sid failed for job 224817.node2 (0 of 8 bytes)
10/27/2005 11:30:45;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
10/27/2005 11:30:45;0001;   pbs_mom;Job;224817.node2;ALERT:  job failed phase 3 start, server will retry
10/27/2005 11:30:45;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type StatusJob request received from PBS_Server at node2, sock=13
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type ModifyJob request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0008;   pbs_mom;Job;224817.node2;Job Modified at request of PBS_Server at node2
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type DeleteJob request received from PBS_Server at node2, sock=12
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type QueueJob request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type JobScript request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type Commit request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in TMomFinalizeJob3, read of pipe for sid failed for job 224818.node2 (0 of 8 bytes)
10/27/2005 11:30:45;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
10/27/2005 11:30:45;0001;   pbs_mom;Job;224818.node2;ALERT:  job failed phase 3 start, server will retry
10/27/2005 11:30:45;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type StatusJob request received from PBS_Server at node2, sock=13
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type ModifyJob request received from PBS_Server at node2, sock=10
10/27/2005 11:30:45;0008;   pbs_mom;Job;224818.node2;Job Modified at request of PBS_Server at node2
10/27/2005 11:30:45;0100;   pbs_mom;Req;;Type DeleteJob request received from PBS_Server at node2, sock=12

-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07105
Student  Ph.D. @ CS Dept. NJIT


More information about the torqueusers mailing list