[torqueusers] Question about checkpoint for MPI
levin108 at gmail.com
Wed Dec 5 19:20:51 MST 2012
On 2012年12月06日 01:38, Michael Jennings wrote:
> On Tuesday, 04 December 2012, at 16:29:39 (+0800),
> levin li wrote:
>> We're trying to use torque to checkpoint MPI jobs, but it seems that
>> torque can only handle jobs running on a single node, I checked the
>> code and found that when using qhold to checkpoint a job, qhold
>> sends a PBS_BATCH_HoldJob request to pbs server, then pbs server
>> relays this request to master host, and then master host checkpoints
>> the job processes running on itself with BLCR, but not send the
>> request to its sister nodes, so it seems that MPI jobs can not be
>> I'm not sure whether I'm right, is there anybody who can tell me
>> whether it is true with torque, or if I'm right, do you have any
>> plans to make checkpoint for MPI jobs available ?
> We work with the BLCR team here at LBNL and have collaborated with
> them extensively on trying to get MPI job checkpointing to work across
> nodes. We were able to get it working fairly reliably on a small set
> of nodes, but the greater the number of nodes in the job, the greater
> the chance of failure. (I'll skip the details unless someone's
> specifically interested.)
> As the project is no longer being funded (AFAIK), the odds of this
> being resolved by the BLCR team in the immediate future are slim.
> While I don't claim to speak for them, I can say that we chose to
> abandon our plans to deploy BLCR as a job preemption solution for our
> TORQUE/Moab systems as a direct result of the prognosis given to us by
> the team. The bottom line is, they're about 90-95% of the way there,
> but the remaining 5-10% is a significant challenge. Without funding,
> it doesn't seem likely the work will be completed any time soon.
> If you are interested in working with them or have any ideas for
> getting the project going again, I'm sure they'd be happy to hear from
> you at checkpoint at lbl.gov.
That's really a bad news, I haven't find a reliable way to checkpoint
MPI jobs so far, and I'd like to contribute to any project that aims to
make this possible.
As I know, BLCR is an opensource project, there're many people who wants
to contributes their effort to make things better, I'm one of them, but
I can not find where BLCR is hosted, I can only find the source package,
I think hosting the project in github may makes it easier for us to work
More information about the torqueusers