[torqueusers] checkpoint process relocation

Glen Beane glen.beane at gmail.com
Mon Aug 4 10:43:12 MDT 2008


On Mon, Aug 4, 2008 at 11:50 AM, Stijn De Weirdt <Stijn.DeWeirdt at ugent.be>wrote:

> hi all,
>
> we are playing around with torque and blcr, and one of the things we are
> trying is placing a job in hold, and trying to restart it on an other node.
> (yes, i know it's not officially supported ;)
>
> we are using blcr 071 with torque 2.4.0-snap.200807091010
> a "simple" qhold and qrls work, but when we flag the processing node
> offline, a qrls keeps the job in state queued (and i don't find an obvious
> qalter option).
>
> checkjob says
> ...
> job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc:
> 15057, msg: 'Cannot execute at specified host because of checkpoint or
> stagein files REJHOST=node11-2.somedomain MSG=cannot allocate node
> 'node11-2.somedomain' to job - node not currently available (state:
> offline)') Holds:    Defer  (hold reason:  RMFailure)
> ...
>
>
> so my question is:
> is this supposed to be working (and if not, is it planned)?


nope, yes


>
> and is this possible for mpi jobs (ie relocation of the processes) (i'm
> going to guess not, but i kindof hope i'm wrong ;)



not right now.   I'd like to work with the OpenMPI folks to get TORQUE aware
of BLCR-enabled OpenMPI
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080804/878a7572/attachment-0001.html


More information about the torqueusers mailing list