[torquedev] torque+blcr+openmpi
Danny Sternkopf
dsternkopf at hpce.nec.com
Tue Jul 6 07:10:32 MDT 2010
Hi,
ok I can image of this use case:
qsub -c enabled myjob.sh
-> pbsid
qchkpt <pbsid>
qdel <pbsid>
qsub -c enabled myjob.sh
-> calls ql-restart-torque-ompi-job
One could also activate regular checkpoints where the user can resume
from them later. Also in case of node failures that would be useful to
save job progress so to say.
However another use case of checkpoint/restart is to allow high priority
jobs to immediately run and allow them to checkpoint other jobs if
required in case of to less resources.
I don't get the use case for your blcr_restart_script script. When
should this be executed by PBS? qrls doesn't work as I can see.
Best regards,
Danny
On 7/6/2010 11:54 AM, Danny Sternkopf wrote:
> Hi,
>
> ah I see. But that would mean this concept can't be used for qhold/qrls,
> right? You always have to reschedule the job.
>
> How does your use case look exactly? You submit a job, then you run
> qhold, then you run qdel, then you resubmit the job?
>
> Regards,
>
>
> Danny
>
>
> On 7/6/2010 11:54 AM, Peter Kruse wrote:
>> Hi Rishi,
>>
>> rishi pathak wrote:
>>> Hi Danny,
>>> Is there a need for checkpointing mpirun/mpiexec
>>> processes(Please correct me if I am wrong). They are spawning MPI
>>> program on
>>> defined nodes. For restarting a checkpointed MPI program, a fresh
>>> instance
>>> of mpirun, mpiexec or pbsdsh can be used.
>>
>> exactly, this is how we use and see it. You submit a new job but with
>> the same node geometry and can then restart the same job.
>>
>> Regards,
>>
>> Peter
>> _______________________________________________
>> torquedev mailing list
>> torquedev at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torquedev
>>
More information about the torquedev
mailing list