[torquedev] qrerun gets blocked

Jacques Normand jnormand at nerim.net
Fri Dec 23 17:09:31 MST 2005


Hi,

I have a weird problem with the 2.0.0 serie (up to p4). I cannot rerun
jobs on my debian opteron cluster. If I submit a:

echo "hostname ; sleep 30" | qsub -l "nodes=1:ppn=4"

everything goes well but if I want to requeue it as root:

qrerun $n

and I get my job blocked in the queued status. The mom complains about
an unknown job id in req_rdytocommit. And the job is kept in the Q state
with the node is executed on before showing when I run qstat.


12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type QueueJob request received from PBS_Server at janeway.rice.edu, sock=10
12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type JobScript request received from PBS_Server at janeway.rice.edu, sock=10
12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type MoveJobFile request received from PBS_Server at janeway.rice.edu, sock=10
12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at janeway.rice.edu, sock=10
12/23/2005 17:43:58;0001;   pbs_mom;Svr;pbs_mom;Success (0) in req_rdytocommit, unknown job id
12/23/2005 17:43:58;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=ReadyToCommit, from PBS_Server at janeway.rice.edu


I have looked at the code and it seems that the "socket test" in 
src/server/req_quejob.c. Is not good. If I comment that line:

(pj->ji_qs.ji_un.ji_newt.ji_fromaddr == get_connectaddr(sock))))

then I can rerun, but I don't really understand where the problems come
from. And whether it is safe or not to disable this test.

thanks for your help

jacques
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20051223/77803d31/attachment.bin


More information about the torquedev mailing list