[torqueusers] Job exits without running.

Ted Sume Nzuonkwelle Ted.Nzuonkwelle at Colorado.Edu
Mon Jan 22 10:12:35 MST 2007


I am running torque-2.1.6 on Fedora Core 5 on a cluster with 7 nodes. 
All nodes come up as being online from the master node. I ran momctl -d
-3 on all nodes and communication with the master node was confirmed.

However when i submit a job, it gets dispatched to the nodes but doesn't
run and exits almost immediately. Error and output files are created in
my home directory, however, they are all empty.

Before submitting a job i run pbsnodes -a, i get the following showing
that node2 is available. (disabled the other nodes so i can troubleshoot
with one node. The problems are identical on all the nodes.

node2
     state = free
     np = 2
     ntype = cluster
     status = opsys=linux,uname=Linux node2.cl.company.com
2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006
i686,sessions=1832,nsessions=1,nusers=1,idletime=377618,totmem=3080744kb,availmem=3002164kb,physmem=1032496kb,ncpus=2,loadave=0.00,netload=843525186,state=free,jobs=? 0,rectime=1169484327


After  running echo "sleep 100" | qsub, and then pbsnodes -a, i get the
following node info, and error in /var/log/messages.

node2
     state = free
     np = 2
     ntype = cluster
     status = opsys=linux,uname=Linux node2.cl.company.com
2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006
i686,sessions=1832,nsessions=1,nusers=1,idletime=378113,totmem=3080744kb,availmem=3002204kb,physmem=1032496kb,ncpus=2,loadave=0.00,netload=843902828,state=free,jobs=? 15201,rectime=1169484813

from /var/log/messages on head node

Jan 22 09:45:25 multipole PBS_Server: stream_eof, connection to node2 is
bad, remote service may be down, message may be corrupt, or connection
may have been dropped remotely (Premature end of message).  setting node
state to down

from node2's mom log file

01/22/2007 09:42:38;0002;   pbs_mom;Svr;pbs_mom;Is up
01/22/2007 09:42:38;0002;   pbs_mom;Svr;mom_main;MOM executable path and
mtime at launch: /usr/local/torque-2.1.6/sbin/pbs_mom 1168385329
01/22/2007 09:42:38;0002;   pbs_mom;n/a;mom_main;hello sent to server
multipole.company.com
01/22/2007 09:50:46;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
01/22/2007 09:50:46;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters
01/22/2007 09:50:47;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
01/22/2007 09:50:47;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters

Any help will be greatly appreciated.

- Ted





More information about the torqueusers mailing list