[torqueusers] torque3.0.0 checkpoint question

bigheadwen001 bigheadwen001 at 163.com
Mon Dec 13 20:44:17 MST 2010


Hi,
  there's a problem with my checkpoint/restart working with torque3.0.0,i had install the blcr0.8.2.and ompi1.4.2,and the CR (including ompi+blcr) all work without torque,and the sequential work  well with torque +blcr,but when i change the checkpoint_script and restart_script and checkpoint/restart a mpi job ,it work well for checkpoint the job using qhold, and the checkpoint file was copied to the /var/spool/torque/checkpoint/,but when restart the job using qrls,the job was always enqueued and the log show the following msg


PBS_Server;LOG_ERROR::Unknown node (15064) in set_nodes, request failed, corrupt request

PBS_Server;Req;req_reject;Reject reply code=15059(Cannot execute at specified host be

cause of checkpoint or stagein files), aux=0, type=RunJob, from Scheduler at node7
are there anyone meet the same question,any reply and suggestion will be appreciated.tks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101214/529aff3b/attachment-0001.html 


More information about the torqueusers mailing list