[torqueusers] mom's error: Bad file descriptor (9) in finish_exec...

Yaroslav Halchenko maui at onerussian.com
Tue Apr 12 11:04:21 MDT 2005


Dear Torquers,

Please advise on how to resolve the problem. I have torque(1.1.0p4)
+ maui setup running under Debian unstable. Everything was smooth until
I decided to fix cluster a bit and move all the nodes under domain of
4th level because originally for some reason they all just had fake
names in the domain of upper level. And it seems to went fine but now
some moms refuse to run the job having in their logs (node2 serves as a
server)


04/12/2005 12:40:16;0002;   pbs_mom;Svr;pbs_mom;Is up
04/12/2005 12:40:16;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
04/12/2005 12:40:40;0100;   pbs_mom;Req;;Type queuejob request received from PBS_Server at node2.cluster.xxx.edu, sock=10
04/12/2005 12:40:40;0100;   pbs_mom;Req;;Type jobscript request received from PBS_Server at node2.cluster.xxx.edu, sock=10
04/12/2005 12:40:40;0100;   pbs_mom;Req;;Type readytocommit request received from PBS_Server at node2.cluster.xxx.edu, sock=10
04/12/2005 12:40:40;0100;   pbs_mom;Req;;Type commit request received from PBS_Server at node2.cluster.xxx.edu, sock=10
04/12/2005 12:40:41;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in finish_exec, read of pipe for sid failed for job 67405.node2.cluster.xxx.edu (0 of 8 bytes
)
04/12/2005 12:40:41;0008;   pbs_mom;Req; ;sending ABORT to sisters
04/12/2005 12:40:41;0008;   pbs_mom;Job;67405.node2.cluster.xxx.edu;start failed, improper sid
04/12/2005 12:40:41;0100;   pbs_mom;Req;;Type statusjob request received from PBS_Server at node2.cluster.xxx.edu, sock=13
04/12/2005 12:40:41;0080;   pbs_mom;Job;67405.node2.cluster.xxx.edu;Obit sent
04/12/2005 12:40:41;0100;   pbs_mom;Req;;Type deletejob request received from PBS_Server at node2.cluster.xxx.edu, sock=10

in the server logs nothing really at all:
04/12/2005 12:40:16;0004;PBS_Server;Svr;is_request;HELLO received from node19
04/12/2005 12:40:40;0008;PBS_Server;Job;67405.node2.cluster.xxx.edu;Job Run at request of root at node2.cluster.xxx.edu

Can you please advise on what is wrong????


-- 
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
        101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07105
Student  Ph.D. @ CS Dept. NJIT


More information about the torqueusers mailing list