[torqueusers] problem with checkpoint and restart with
torque+LAM-MPI?
Liu Xuezhao
lxz at ncic.ac.cn
Wed Jun 21 20:34:59 MDT 2006
Hi,
I am using LAM+TORQUE+BLCR=A3=ACi failed to restart che lam task=
under PBS(torque).
I tested it like this:
(1). Use torque to submit a task, the script is:
#PBS -S /bin/bash
#PBS -N Linpack
#PBS -l nodes=3D2:ppn=3D1
lamboot
lamnodes
echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
cd /home/lxz/src/HPL/hpl/bin/lxz
mpirun -np 2 xhpl
echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
lamhalt
I use "qsub linpack.sh" to submit it to torque.
(2) checkpoint it manually:
cr_checkpoint *** (PID of miprun)
After the execution, I can find the checkpint files.(3=
files here)
(3) kill the task:
killall xhpl
(4) restart the task.
the sript:
#PBS -S /bin/bash
#PBS -N Mm5
#PBS -l nodes=3D2:ppn=3D1
lamboot
lamnodes
echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
cd /home/lxz/tmp
cr_restart context.11468
echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
lamhalt
i type "qsub restartlinpack.sh" to the commandline.
But the task is not restarted crrectly, I tarced the=
sorcecode of BLCR, found the reason is the=
"cr_restore_all_files" is failed because it can't find the file=
"/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart=
the task.
I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.
Am i doing somthing wrong? How can i checkpoint and restart a=
task under torque(or openPBS) and LAM?
Thanks!
Liu xuezhao
2006-06-21
More information about the torqueusers
mailing list