[torquedev] torque+blcr+openmpi

Danny Sternkopf dsternkopf at hpce.nec.com
Tue Jul 6 07:10:32 MDT 2010


Hi,

ok I can image of this use case:

qsub -c enabled myjob.sh
-> pbsid
qchkpt <pbsid>
qdel <pbsid>
qsub -c enabled myjob.sh
-> calls ql-restart-torque-ompi-job


One could also activate regular checkpoints where the user can resume 
from them later. Also in case of node failures that would be useful to 
save job progress so to say.

However another use case of checkpoint/restart is to allow high priority 
jobs to immediately run and allow them to checkpoint other jobs if 
required in case of to less resources.

I don't get the use case for your blcr_restart_script script. When 
should this be executed by PBS? qrls doesn't work as I can see.

Best regards,

Danny

On 7/6/2010 11:54 AM, Danny Sternkopf wrote:
> Hi,
>
> ah I see. But that would mean this concept can't be used for qhold/qrls,
> right? You always have to reschedule the job.
>
> How does your use case look exactly? You submit a job, then you run
> qhold, then you run qdel, then you resubmit the job?
>
> Regards,
>
>
> Danny
>
>
> On 7/6/2010 11:54 AM, Peter Kruse wrote:
>> Hi Rishi,
>>
>> rishi pathak wrote:
>>> Hi Danny,
>>> Is there a need for checkpointing mpirun/mpiexec
>>> processes(Please correct me if I am wrong). They are spawning MPI
>>> program on
>>> defined nodes. For restarting a checkpointed MPI program, a fresh
>>> instance
>>> of mpirun, mpiexec or pbsdsh can be used.
>>
>> exactly, this is how we use and see it. You submit a new job but with
>> the same node geometry and can then restart the same job.
>>
>> Regards,
>>
>> Peter
>> _______________________________________________
>> torquedev mailing list
>> torquedev at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torquedev
>>


More information about the torquedev mailing list