[torqueusers] problem with checkpoint and restart withtorque+LAM-MPI?

Liu Xuezhao lxz at ncic.ac.cn
Thu Jun 22 19:34:16 MDT 2006


>On Thu, Jun 22, 2006 at 10:34:59AM +0800, Liu Xuezhao alleged:
>>     But the task is not restarted crrectly, I tarced the sorcecode of BLCR, found the reason is the "cr_restore_all_files" is failed because it can't find the file "/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart the task.
>> 
>>     I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.
>> 
>>     Am i doing somthing wrong? How can i checkpoint and restart a task under torque(or openPBS) and LAM?
>>     Thanks!
>
>TORQUE doesn't (yet) have BLCR support, so there isn't a lot we can do
>at this point.  The BLCR people can probably answer this better since
>I'm just guessing...
>
>But I support failing because of the output file makes sense.  Can you
>specify an output file in your homedir with mpirun?  If so, then it
>would still exist for the restart.
>
Thank you for the reply.
I think the problem is not only the BLCR module cann't find the file "/usr/spool/PBS/spool/69.ganode00.OU", because when i copy back the file before cr_restart it cannot be restarted also.
I am not familiar with torque, but i think it is a task manage system. The checkpoint/restart is supportted by LAM-MPI and BLCR. I can checkpoint and restart mpi task outside torque, but when i submit the mpi task inside torque and checkpoint it, I cann't restart it later. Perhaps the reason is the tm boot module?




More information about the torqueusers mailing list