[torquedev] qrerun gets blocked

Dave Jackson jacksond at clusterresources.com
Fri Dec 23 17:34:56 MST 2005


Jacques,

  This may be fixed in the latest pre-patch 5 snapshot.  We saw and
corrected a similar problem but it was only for 64 bit systems.  Please
let us know if the latest snapshot addresses the issue (the bug was in
the pbs_mom daemon).  If not, we will get to work.

Dave

On Fri, 2005-12-23 at 18:09 -0600, Jacques Normand wrote:
> Hi,
> 
> I have a weird problem with the 2.0.0 serie (up to p4). I cannot rerun
> jobs on my debian opteron cluster. If I submit a:
> 
> echo "hostname ; sleep 30" | qsub -l "nodes=1:ppn=4"
> 
> everything goes well but if I want to requeue it as root:
> 
> qrerun $n
> 
> and I get my job blocked in the queued status. The mom complains about
> an unknown job id in req_rdytocommit. And the job is kept in the Q state
> with the node is executed on before showing when I run qstat.
> 
> 
> 12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type QueueJob request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type JobScript request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type MoveJobFile request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0100;   pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0001;   pbs_mom;Svr;pbs_mom;Success (0) in req_rdytocommit, unknown job id
> 12/23/2005 17:43:58;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=ReadyToCommit, from PBS_Server at janeway.rice.edu
> 
> 
> I have looked at the code and it seems that the "socket test" in 
> src/server/req_quejob.c. Is not good. If I comment that line:
> 
> (pj->ji_qs.ji_un.ji_newt.ji_fromaddr == get_connectaddr(sock))))
> 
> then I can rerun, but I don't really understand where the problems come
> from. And whether it is safe or not to disable this test.
> 
> thanks for your help
> 
> jacques
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev



More information about the torquedev mailing list