[torquedev] torque+blcr+openmpi

Peter Kruse pk at q-leap.com
Tue Jul 6 08:09:43 MDT 2010


Hi,

currently I see no way to checkpoint the complete PBS job
with an OpenMPI job.  It is however possible to checkpoint
the Job itself with ompi-checkpoint.  If you also want to
terminate the job then you could give the $signalNum the default
value of "15" then it will terminate and free the resources.
In this case the restart_script can resume the job.

Regards,

   Peter

Danny Sternkopf wrote:
> Hi,
> 
> ok I can image of this use case:
> 
> qsub -c enabled myjob.sh
> -> pbsid
> qchkpt <pbsid>
> qdel <pbsid>
> qsub -c enabled myjob.sh
> -> calls ql-restart-torque-ompi-job
> 
> 
> One could also activate regular checkpoints where the user can resume 
> from them later. Also in case of node failures that would be useful to 
> save job progress so to say.
> 
> However another use case of checkpoint/restart is to allow high priority 
> jobs to immediately run and allow them to checkpoint other jobs if 
> required in case of to less resources.
> 
> I don't get the use case for your blcr_restart_script script. When 
> should this be executed by PBS? qrls doesn't work as I can see.
> 
> Best regards,
> 
> Danny
> 
> On 7/6/2010 11:54 AM, Danny Sternkopf wrote:
>> Hi,
>>
>> ah I see. But that would mean this concept can't be used for qhold/qrls,
>> right? You always have to reschedule the job.
>>
>> How does your use case look exactly? You submit a job, then you run
>> qhold, then you run qdel, then you resubmit the job?
>>
>> Regards,
>>
>>
>> Danny
>>
>>
>> On 7/6/2010 11:54 AM, Peter Kruse wrote:
>>> Hi Rishi,
>>>
>>> rishi pathak wrote:
>>>> Hi Danny,
>>>> Is there a need for checkpointing mpirun/mpiexec
>>>> processes(Please correct me if I am wrong). They are spawning MPI
>>>> program on
>>>> defined nodes. For restarting a checkpointed MPI program, a fresh
>>>> instance
>>>> of mpirun, mpiexec or pbsdsh can be used.
>>> exactly, this is how we use and see it. You submit a new job but with
>>> the same node geometry and can then restart the same job.
>>>
>>> Regards,
>>>
>>> Peter
>>> _______________________________________________
>>> torquedev mailing list
>>> torquedev at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torquedev
>>>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
> 



More information about the torquedev mailing list