[torqueusers] torque maui blcr problem [SEC=UNCLASSIFIED]

DOHERTY, Greg gdz at ansto.gov.au
Thu Nov 18 21:27:17 MST 2010


Please see the error message below from maui executing check job 281
We have installed torque 2.5.3  with BLCR enabled.

The context file has been returned to the server following the qhold
command
and is sitting in /var/spool/torque/checkpoint/281*CK directory, owned
by
the user.

The .JB and .SC files are on the server in directory 
/var/spool/torque/server_priv/jobs, both owned by root.

The cr_run, cr_checkpoint, and cr_restart commands run fine by
themselves for this task on a compute node.

qrls 281 leaves the job queued but not restarted. 

maui checkjob message:
job is deferred.  Reason:  RMFailure  (cannot start job - RM failure,
rc: 15059, msg: 'Cannot execute at specified host because of checkpoint
or stagein files MSG=allocated nodes must match checkpoint location')
Holds:    Defer  (hold reason:  RMFailure)
PE:  8.00  StartPriority:  23
cannot select job 281 for partition DEFAULT (job hold active)

Any help on where to start looking would be greatly appreciated.
Greg Doherty


More information about the torqueusers mailing list