[torqueusers] Testing checkpoint/restart in version 2.4.16
brianm at usc.edu
Mon Aug 29 17:20:54 MDT 2011
I have built a very small 3 node cluster in order to test the latest version of torque and have run into some very strange issues.
According to the documentation ( http://www.adaptivecomputing.com/resources/docs/torque/2.6jobcheckpoint.php ) I followed everything to the T, only to find that the script 'blcr_checkpoint_script' had some errors in it surrounding the '$depth' argument, and the simple fact that the script is invoked with different arguments then the example script accepts...
I added '$depth' to the variable init line near the top of the script, and fixed the missing ',' in the ARGV check, but it still failed because the script thought the checkpoint dir should be following the UID, but the GID follows the UID as you can see in the logs...
example log entries:
Aug 29 15:58:40 hpcjr0003-e0 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 17240 3274.hpc-dns-l.usc.edu hpccadm super /var/spool/torque/checkpoint/3274.hpc-dns-l.
usc.edu.CK ckpt.3274.hpc-dns-l.usc.edu.1314658720 0 -
Aug 29 15:58:40 hpcjr0003-e0 checkpoint_script: Unable to cd to checkpoint dir (super): No such file or directory
Aug 29 15:58:40 hpcjr0003-e0 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 255
Aug 29 15:58:40 hpcjr0003-e0 pbs_mom: LOG_ERROR::blcr_checkpoint_job, pbs_alterjob requested on job 3274.hpc-dns-l.usc.edu failed (15007)
After modifying the script (I can provide anyone with my current copy if anyone is interested), I still don't seem to have a working checkpoint:
Aug 29 16:02:17 hpcjr0007-e0 checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 19165 3275.hpc-dns-l.usc.edu hpccadm super /var/spool/torque/checkpoint/3275.hpc-dns-l.
usc.edu.CK ckpt.3275.hpc-dns-l.usc.edu.1314658937 0 -
Aug 29 16:02:17 hpcjr0007-e0 kernel: blcr: warning: skipped a socket.
Aug 29 16:02:17 hpcjr0007-e0 pbs_mom: LOG_ERROR::blcr_checkpoint_job, pbs_alterjob requested on job 3275.hpc-dns-l.usc.edu failed (15007)
That looks like a blcr error, but I *DO* have files in the checkpoint directory. The problem is that 'qhold' will not stop the job from running, and no directories are cleaned up after jobs end ...
Has anyone ever been able to get a working checkpointed job held and then restarted?
How do I accomplish this if the documentation is not accurate?
University of Southern California
More information about the torqueusers