[torqueusers] torque maui blcr problem [SEC=UNCLASSIFIED]
gdz at ansto.gov.au
Thu Nov 18 21:27:17 MST 2010
Please see the error message below from maui executing check job 281
We have installed torque 2.5.3 with BLCR enabled.
The context file has been returned to the server following the qhold
and is sitting in /var/spool/torque/checkpoint/281*CK directory, owned
The .JB and .SC files are on the server in directory
/var/spool/torque/server_priv/jobs, both owned by root.
The cr_run, cr_checkpoint, and cr_restart commands run fine by
themselves for this task on a compute node.
qrls 281 leaves the job queued but not restarted.
maui checkjob message:
job is deferred. Reason: RMFailure (cannot start job - RM failure,
rc: 15059, msg: 'Cannot execute at specified host because of checkpoint
or stagein files MSG=allocated nodes must match checkpoint location')
Holds: Defer (hold reason: RMFailure)
PE: 8.00 StartPriority: 23
cannot select job 281 for partition DEFAULT (job hold active)
Any help on where to start looking would be greatly appreciated.
More information about the torqueusers