[torqueusers] checkpoint process relocation

Stijn De Weirdt Stijn.DeWeirdt at ugent.be
Mon Aug 4 09:50:09 MDT 2008


hi all,

we are playing around with torque and blcr, and one of the things we are 
trying is placing a job in hold, and trying to restart it on an other 
node. (yes, i know it's not officially supported ;)

we are using blcr 071 with torque 2.4.0-snap.200807091010
a "simple" qhold and qrls work, but when we flag the processing node 
offline, a qrls keeps the job in state queued (and i don't find an 
obvious qalter option).

checkjob says
...
job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, 
rc: 15057, msg: 'Cannot execute at specified host because of checkpoint 
or stagein files REJHOST=node11-2.somedomain MSG=cannot allocate node 
'node11-2.somedomain' to job - node not currently available (state: 
offline)') Holds:    Defer  (hold reason:  RMFailure)
...


so my question is:
is this supposed to be working (and if not, is it planned)?
and is this possible for mpi jobs (ie relocation of the processes) (i'm 
going to guess not, but i kindof hope i'm wrong ;)


thanks a lot,

stijn



More information about the torqueusers mailing list