[torqueusers] qhold not functional

Al Taufer ataufer at clusterresources.com
Thu Jun 12 11:21:55 MDT 2008


I have tested this with versions 2.2, 2.3, 2.4 branches.

In 2.2 it acts the same as in 2.4, qhold sets the hold and returns no error.

In 2.3, qhold still sets the hold but returns the 15029 error.

Documentation I looked at are online Torque docs and as well as man 
pages from 2.3, 2.3 & 2.4.  In all of them there is a paragraph that 
starts with "If the job is in running state".  The next paragraph is the 
one that says it will set the hold attribute if checkpoint/restart is 
not supported.


Glen Beane wrote:
>
>
> On Thu, Jun 12, 2008 at 12:49 PM, Al Taufer 
> <ataufer at clusterresources.com <mailto:ataufer at clusterresources.com>> 
> wrote:
>
>     It seems that the code is returning an error message when it
>     should not be returning one.
>
>     The documentation says  that for a running job if checkpoint /
>     restart is not supported, qhold will only set the requested hold
>     attribute. This will have no effect unless the job is rerun with
>     the qrerun command.
>
>
> I know this is the case in the 2.4 snapshots.  The hold does get set, 
> and there is no error message displayed by qhold.  Pre 2.4  torque 
> versions complain that the job can't be checkpointed and don't set the 
> hold.  Which version of the documentation says the hold will be set 
> even if the job can't be checkpointed? 
>
>  
>
>
>
>     You should be able to verify that the hold is still being placed
>     on the job by using 'qstat -f' and checking the Hold_Types value.
>
>     Al
>
>     Walid wrote:
>
>         Hi All,
>
>         I have installed toruqe 2.3.0 with maui, however i find that i
>         am having a different behaviour when i am trying to hold jobs,
>         qhold complains that the request is rejected, when i check the
>         momlogs it mentions check pointing not support, i am not
>         interested in check pointing, however i would like to have the
>         ability to restart the jobs, any pointers would be appreciated
>
>         regards
>
>         Walid
>
>         [root at lnx ~]# qstat -an
>         lnx:                                                          
>                Req'd  Req'd   Elap
>         Job ID               Username Queue    Jobname    SessID NDS  
>         TSK Memory Time  S Time
>         -------------------- -------- -------- ---------- ------ -----
>         --- ------ ----- - -----
>         901.lnx             luser parallel STDIN        5270     1  --
>            --    --  R   --
>           lnx512/0
>         [root at lnx ~]# qhold 901
>         qhold: No support for requested service MSG=MOM rejected hold
>         request: 15029 901.lnx
>         pbs_mom;Req;req_reject;Reject reply code=15029(No support for
>         requested service REJHOST=lnx512 MSG=checkpointing not
>         supported), aux=0, type=HoldJob, from PB
>         S_Server at lnx
>
>         ------------------------------------------------------------------------
>
>         _______________________________________________
>         torqueusers mailing list
>         torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>         http://www.supercluster.org/mailman/listinfo/torqueusers
>          
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


More information about the torqueusers mailing list