[torqueusers] problem with checkpoint and restart
lxz at ncic.ac.cn
Thu Jun 22 19:34:16 MDT 2006
>On Thu, Jun 22, 2006 at 10:34:59AM +0800, Liu Xuezhao alleged:
>> But the task is not restarted crrectly, I tarced the sorcecode of BLCR, found the reason is the "cr_restore_all_files" is failed because it can't find the file "/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart the task.
>> I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.
>> Am i doing somthing wrong? How can i checkpoint and restart a task under torque(or openPBS) and LAM?
>TORQUE doesn't (yet) have BLCR support, so there isn't a lot we can do
>at this point. The BLCR people can probably answer this better since
>I'm just guessing...
>But I support failing because of the output file makes sense. Can you
>specify an output file in your homedir with mpirun? If so, then it
>would still exist for the restart.
Thank you for the reply.
I think the problem is not only the BLCR module cann't find the file "/usr/spool/PBS/spool/69.ganode00.OU", because when i copy back the file before cr_restart it cannot be restarted also.
I am not familiar with torque, but i think it is a task manage system. The checkpoint/restart is supportted by LAM-MPI and BLCR. I can checkpoint and restart mpi task outside torque, but when i submit the mpi task inside torque and checkpoint it, I cann't restart it later. Perhaps the reason is the tm boot module?
More information about the torqueusers