[torqueusers] problem with checkpoint and restart with torque+LAM-MPI?

Garrick Staples garrick at clusterresources.com
Thu Jun 22 15:57:29 MDT 2006


On Thu, Jun 22, 2006 at 10:34:59AM +0800, Liu Xuezhao alleged:
>     But the task is not restarted crrectly, I tarced the sorcecode of BLCR, found the reason is the "cr_restore_all_files" is failed because it can't find the file "/usr/spool/PBS/spool/69.ganode00.OU" and then failed to restart the task.
> 
>     I am using lam-7.1.2b30, torque-2.0.0p8 and blcr-0.4.1_b4.
> 
>     Am i doing somthing wrong? How can i checkpoint and restart a task under torque(or openPBS) and LAM?
>     Thanks!

TORQUE doesn't (yet) have BLCR support, so there isn't a lot we can do
at this point.  The BLCR people can probably answer this better since
I'm just guessing...

But I support failing because of the output file makes sense.  Can you
specify an output file in your homedir with mpirun?  If so, then it
would still exist for the restart.




More information about the torqueusers mailing list