[torqueusers] Question about checkpoint for MPI

Michael Jennings mej at lbl.gov
Wed Dec 5 10:38:01 MST 2012


On Tuesday, 04 December 2012, at 16:29:39 (+0800),
levin li wrote:

> We're trying to use torque to checkpoint MPI jobs, but it seems that
> torque can only handle jobs running on a single node, I checked the
> code and found that when using qhold to checkpoint a job, qhold
> sends a PBS_BATCH_HoldJob request to pbs server, then pbs server
> relays this request to master host, and then master host checkpoints
> the job processes running on itself with BLCR, but not send the
> request to its sister nodes, so it seems that MPI jobs can not be
> checkpointable.
> 
> I'm not sure whether I'm right, is there anybody who can tell me
> whether it is true with torque, or if I'm right, do you have any
> plans to make checkpoint for MPI jobs available ?

We work with the BLCR team here at LBNL and have collaborated with
them extensively on trying to get MPI job checkpointing to work across
nodes.  We were able to get it working fairly reliably on a small set
of nodes, but the greater the number of nodes in the job, the greater
the chance of failure.  (I'll skip the details unless someone's
specifically interested.)

As the project is no longer being funded (AFAIK), the odds of this
being resolved by the BLCR team in the immediate future are slim.
While I don't claim to speak for them, I can say that we chose to
abandon our plans to deploy BLCR as a job preemption solution for our
TORQUE/Moab systems as a direct result of the prognosis given to us by
the team.  The bottom line is, they're about 90-95% of the way there,
but the remaining 5-10% is a significant challenge.  Without funding,
it doesn't seem likely the work will be completed any time soon.

If you are interested in working with them or have any ideas for
getting the project going again, I'm sure they'd be happy to hear from
you at checkpoint at lbl.gov.

Michael

-- 
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615


More information about the torqueusers mailing list