[torqueusers] Job bounces from status R to status Q

Magee, Fred (MRC) fred.magee at atk.com
Mon Oct 31 13:54:47 MST 2005


Good afternoon.
 
I've just installed  torque-2.0.0p0 on an Athlon based cluster running
Red Hat EL 3 U4 with mpich-1.2.6 and mpiexex- 0.77.  I also just
reinstalled the OS on the master node due to a SCSI controller failure.
I have confirmed that root can ssh, scp, rsh and rcp from any node to
any node without password however rcp yields the following messages:
 
Trying krb4 rcp...
trying normal rcp (/usr/bin/rcp)
 
and ssh from master to the compute nodes yields:
 
Warning: No xauth data; using fake authentication data for X11
forwarding.
 
But the command works.
 
If I submit a simple test job to one node, it works.  The job just
prints out the node/processor it is running on as in "Greetings from
head process 0 on node2...".  If I submit the job to two or more nodes I
get the following:
 
10/31/2005 13:35:47;0100;PBS_Server;Job;6.master.cl.abq.mrc;enqueuing
into workq, state 1 hop 1
10/31/2005 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Queued
at request of fmagee at master.cl.abq.mrc, owner =
fmagee at master.cl.abq.mrc, job name = mine.pbs, queue = workq
10/31/2005 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler sent
command new
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type ModifyJob request received
from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Modified
at request of Scheduler at master.cl.abq.mrc
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at master.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Run at
request of Scheduler at master.cl.abq.mrc
10/31/2005 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler sent
command recyc
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at node3.cl.abq.mrc, sock=10
10/31/2005 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler sent
command new
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at node3.cl.abq.mrc, sock=11
10/31/2005 13:35:47;0009;PBS_Server;Job;6.master.cl.abq.mrc;obit
received for job 6.master.cl.abq.mrc from host node3.cl.abq.mrc with bad
state (state: QUEUED)
10/31/2005 13:35:47;0080;PBS_Server;Req;req_reject;Reject reply
code=15016(Request invalid for state of job), aux=0, type=JobObituary,
from pbs_mom at node3.cl.abq.mrc
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusServer request
received from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusNode request
received from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusQueue request
received from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received
from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type ResourceQuery request
received from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at master.cl.abq.mrc, sock=12
10/31/2005 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Run at
request of Scheduler at master.cl.abq.mrc
10/31/2005 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler sent
command recyc
10/31/2005 13:35:47;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at node3.cl.abq.mrc, sock=10
 
I have found no reference to reply code 15016 other than "Request
invalid for state of job". 
 
Can anyone out there point me to a quick solution for this problem?
 
Thanks for your help and have a great day.
 
Fred
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20051031/cc1870bb/attachment-0001.html


More information about the torqueusers mailing list