[torquedev] qrerun gets blocked
Dave Jackson
jacksond at clusterresources.com
Fri Dec 23 17:34:56 MST 2005
Jacques,
This may be fixed in the latest pre-patch 5 snapshot. We saw and
corrected a similar problem but it was only for 64 bit systems. Please
let us know if the latest snapshot addresses the issue (the bug was in
the pbs_mom daemon). If not, we will get to work.
Dave
On Fri, 2005-12-23 at 18:09 -0600, Jacques Normand wrote:
> Hi,
>
> I have a weird problem with the 2.0.0 serie (up to p4). I cannot rerun
> jobs on my debian opteron cluster. If I submit a:
>
> echo "hostname ; sleep 30" | qsub -l "nodes=1:ppn=4"
>
> everything goes well but if I want to requeue it as root:
>
> qrerun $n
>
> and I get my job blocked in the queued status. The mom complains about
> an unknown job id in req_rdytocommit. And the job is kept in the Q state
> with the node is executed on before showing when I run qstat.
>
>
> 12/23/2005 17:43:58;0100; pbs_mom;Req;;Type QueueJob request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0100; pbs_mom;Req;;Type JobScript request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0100; pbs_mom;Req;;Type MoveJobFile request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0100; pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at janeway.rice.edu, sock=10
> 12/23/2005 17:43:58;0001; pbs_mom;Svr;pbs_mom;Success (0) in req_rdytocommit, unknown job id
> 12/23/2005 17:43:58;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=ReadyToCommit, from PBS_Server at janeway.rice.edu
>
>
> I have looked at the code and it seems that the "socket test" in
> src/server/req_quejob.c. Is not good. If I comment that line:
>
> (pj->ji_qs.ji_un.ji_newt.ji_fromaddr == get_connectaddr(sock))))
>
> then I can rerun, but I don't really understand where the problems come
> from. And whether it is safe or not to disable this test.
>
> thanks for your help
>
> jacques
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
More information about the torquedev
mailing list