[torqueusers] checkpoint process relocation
Stijn De Weirdt
Stijn.DeWeirdt at ugent.be
Mon Aug 4 09:50:09 MDT 2008
hi all,
we are playing around with torque and blcr, and one of the things we are
trying is placing a job in hold, and trying to restart it on an other
node. (yes, i know it's not officially supported ;)
we are using blcr 071 with torque 2.4.0-snap.200807091010
a "simple" qhold and qrls work, but when we flag the processing node
offline, a qrls keeps the job in state queued (and i don't find an
obvious qalter option).
checkjob says
...
job is deferred. Reason: RMFailure (cannot start job - RM failure,
rc: 15057, msg: 'Cannot execute at specified host because of checkpoint
or stagein files REJHOST=node11-2.somedomain MSG=cannot allocate node
'node11-2.somedomain' to job - node not currently available (state:
offline)') Holds: Defer (hold reason: RMFailure)
...
so my question is:
is this supposed to be working (and if not, is it planned)?
and is this possible for mpi jobs (ie relocation of the processes) (i'm
going to guess not, but i kindof hope i'm wrong ;)
thanks a lot,
stijn
More information about the torqueusers
mailing list