[torqueusers] Failure on restart job on different node

TingtingYang ytt515 at yahoo.cn
Fri Aug 17 02:58:01 MDT 2012


hi all:    right now,I want to restart my job on a different node but something was wrong.I follow the below steps to checkpoint/restart:    1.qsub -c enabled ctest,sh(ctest.sh is a serial job)    2.qhold JobID(I can successfully hold mpi/serial job and restart on same node)    3,qalter -l nodes=another node(not previous node running job JObID) JobID    4. qrls jobID
job exit with exit_code=139 and/var/log/message saied:
Aug 17 16:47:13 node6 restart_script:
 Invoked: /var/spool/torque/mom_priv/blcr_restart_script 15858 486.node8 ytt ytt /home/share/pbs/486.node8.CK ckpt.486.node8.1345193231 
Aug 17 16:48:35 node6 kernel: crtest[15860]: segfault at 0000003bcf633600 rip 0000003bcf633600 rsp 00007fff2e6cb218 error 14
Aug 17 16:48:35 node6 kernel: 486.node8.SC[15859]: segfault at 0000003bcf6e71a0 rip 0000003bcf6e71a0 rsp 00007fffe0da4e18 error 14
Aug 17 16:48:35 node6 kernel: bash[15858]: segfault at 0000003bcf6e71a0 rip 0000003bcf6e71a0 rsp 00007fff4196c7b8 error 14
Aug 17 16:48:35 node6 restart_script: Subcommand (cr_restart --run-on-success='qalter -W checkpoint_restart_status="Successfully restarted job from checkpoint" 486.node8' --run-on-fail-perm='qalter -W checkpoint_restart_status="Permanent failure restarting job from checkpoint" 486.node8' --run-on-fail-temp='qalter -W
 checkpoint_restart_status="Temporary failure restarting job from checkpoint" 486.node8' --run-on-fail-args='qalter -W checkpoint_restart_status="Argument failure restarting job from checkpoint" 486.node8' --run-on-fail-env='qalter -W checkpoint_restart_status="Environment failure restarting job from checkpoint" 486.node8' --run-on-failure='qalter -W checkpoint_restart_status="General failure restarting job from checkpoint" 486.node8' ckpt.486.node8.1345193231) failed with rc=139: 
I use blcr-0.8.4,openmpi-1.6 and torque-2.5.8 and i installed blcr,openmpi and torque on my shared file system as https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink said.
so can anyone help me please,thank you  
tingting.yang at Beihang university 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120817/865bb641/attachment-0001.html 


More information about the torqueusers mailing list