[torqueusers] checkpoint process relocation
glen.beane at gmail.com
Mon Aug 4 10:43:12 MDT 2008
On Mon, Aug 4, 2008 at 11:50 AM, Stijn De Weirdt <Stijn.DeWeirdt at ugent.be>wrote:
> hi all,
> we are playing around with torque and blcr, and one of the things we are
> trying is placing a job in hold, and trying to restart it on an other node.
> (yes, i know it's not officially supported ;)
> we are using blcr 071 with torque 2.4.0-snap.200807091010
> a "simple" qhold and qrls work, but when we flag the processing node
> offline, a qrls keeps the job in state queued (and i don't find an obvious
> qalter option).
> checkjob says
> job is deferred. Reason: RMFailure (cannot start job - RM failure, rc:
> 15057, msg: 'Cannot execute at specified host because of checkpoint or
> stagein files REJHOST=node11-2.somedomain MSG=cannot allocate node
> 'node11-2.somedomain' to job - node not currently available (state:
> offline)') Holds: Defer (hold reason: RMFailure)
> so my question is:
> is this supposed to be working (and if not, is it planned)?
> and is this possible for mpi jobs (ie relocation of the processes) (i'm
> going to guess not, but i kindof hope i'm wrong ;)
not right now. I'd like to work with the OpenMPI folks to get TORQUE aware
of BLCR-enabled OpenMPI
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers