[torqueusers] problem with checkpoint and restart with torque+LAM-MPI?

Liu Xuezhao lxz at ncic.ac.cn
Wed Jun 21 20:34:59 MDT 2006


Hi,

   I am using LAM+TORQUE+BLCR=A3=ACi failed to restart che lam task=
 under PBS(torque).
   I tested it like this:
   (1). Use torque to submit a task, the script is:
    #PBS -S /bin/bash
    #PBS -N Linpack
    #PBS -l nodes=3D2:ppn=3D1
    lamboot
    lamnodes
    echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
    cd /home/lxz/src/HPL/hpl/bin/lxz
    mpirun -np 2 xhpl
    echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
    lamhalt
        I use "qsub linpack.sh" to submit it to torque.
   (2)  checkpoint it manually:
    cr_checkpoint *** (PID of miprun)
        After the execution, I can find the checkpint files.(3=
 files here)
   (3)  kill the task:
    killall xhpl
   (4)  restart the task.
        the sript:
    #PBS -S /bin/bash
    #PBS -N Mm5
    #PBS -l nodes=3D2:ppn=3D1
    lamboot
    lamnodes
    echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
    cd /home/lxz/tmp
    cr_restart context.11468
    echo "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D"
    lamhalt
        i type "qsub restartlinpack.sh" to the commandline.
    
    But the task is not restarted crrectly, I tarced the=
 sorcecode of BLCR, found the reason is the=
 "cr_restore_all_files" is failed because it can't find the file=
 "/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart=
 the task.

    I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.

    Am i doing somthing wrong? How can i checkpoint and restart a=
 task under torque(or openPBS) and LAM?
    Thanks!

Liu xuezhao
2006-06-21




More information about the torqueusers mailing list