[torqueusers] Question about checkpoint for MPI
Andrus, Brian Contractor
bdandrus at nps.edu
Wed Dec 5 14:31:19 MST 2012
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of Michael Jennings
> Sent: Wednesday, December 05, 2012 9:38 AM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Question about checkpoint for MPI
> On Tuesday, 04 December 2012, at 16:29:39 (+0800), levin li wrote:
> > We're trying to use torque to checkpoint MPI jobs, but it seems that
> > torque can only handle jobs running on a single node, I checked the
> > code and found that when using qhold to checkpoint a job, qhold sends
> > a PBS_BATCH_HoldJob request to pbs server, then pbs server relays this
> > request to master host, and then master host checkpoints the job
> > processes running on itself with BLCR, but not send the request to its
> > sister nodes, so it seems that MPI jobs can not be checkpointable.
> > I'm not sure whether I'm right, is there anybody who can tell me
> > whether it is true with torque, or if I'm right, do you have any plans
> > to make checkpoint for MPI jobs available ?
> We work with the BLCR team here at LBNL and have collaborated with them
> extensively on trying to get MPI job checkpointing to work across nodes.
> We were able to get it working fairly reliably on a small set of nodes, but the
> greater the number of nodes in the job, the greater the chance of failure.
> (I'll skip the details unless someone's specifically interested.)
> As the project is no longer being funded (AFAIK), the odds of this being
> resolved by the BLCR team in the immediate future are slim.
> While I don't claim to speak for them, I can say that we chose to abandon our
> plans to deploy BLCR as a job preemption solution for our TORQUE/Moab
> systems as a direct result of the prognosis given to us by the team. The
> bottom line is, they're about 90-95% of the way there, but the remaining 5-
> 10% is a significant challenge. Without funding, it doesn't seem likely the
> work will be completed any time soon.
> If you are interested in working with them or have any ideas for getting the
> project going again, I'm sure they'd be happy to hear from you at
> checkpoint at lbl.gov.
> Michael Jennings <mej at lbl.gov>
> Senior HPC Systems Engineer
> High-Performance Computing Services
> Lawrence Berkeley National Laboratory
> Bldg 50B-3209E W: 510-495-2687
> MS 050B-3209 F: 510-486-8615
> torqueusers mailing list
> torqueusers at supercluster.org
Well, That is sad news.
What are the options out there for checkpoint/restart of a job then?
Naval Postgraduate School
More information about the torqueusers