[torqueusers] Issue copying OU file

Kit Menlove kit at byu.net
Tue Jan 29 11:07:58 MST 2013


Hi all,

 

I'm using a cluster that uses Torque as the batch system.  About half of the
time, checkpointing with DMTCP fails while copying the temporary output
buffer/file:

 

cp -f /var/spool/torque/spool/jobid.myserver.OU
/checkpoint_dir/ckpt_myprog_52b886013bb1c112-27763-51060104_files/jobid.myse
rver.OU_99001

 

I'm using dmtcp_checkpoint (v1.2.6) with the --checkpoint-open-files option.
All I know is that the copy command fails, not why (though I know the
destination directory exists and it does work about half the time).  Can
anyone explain why the OU file might not exist at the time of checkpointing,
or what else might be the cause of the failure?

 

Thanks,

Kit

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130129/48b87fc2/attachment-0001.html 


More information about the torqueusers mailing list