[torqueusers] Question about checkpoint for MPI
chenry at ittc.ku.edu
Wed Dec 5 14:51:16 MST 2012
----- Original Message -----
> From: "Brian Contractor Andrus" <bdandrus at nps.edu>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Wednesday, December 5, 2012 3:31:19 PM
> Subject: Re: [torqueusers] Question about checkpoint for MPI
> Well, That is sad news.
> What are the options out there for checkpoint/restart of a job then?
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey, California
> voice: 831-656-6238
BLCR still works for many jobs. We are using Torque+Maui+BLCR, but we are not finished with our configurations.
We see this as our solution for the throuput -vs- availability -vs- turnaround time dilemma. We have a mixture of jobs--some researchers need access right away on interactive jobs, some run MPI jobs, and some run lots of small single-core jobs. The solution here is to organize queues (torque) with different QOS definitions (maui). This lets the interactive jobs preempt some other jobs (maui). We lose useful cycles unless we have a working checkpoint/restart scheme (BLCR).
The main caveat I'm seeing here is: We need to create another queue+QOS for large MPI jobs so they cannot be preempted.
I'd still be interested in knowing other alternatives, but at least for now, we are moving forward with this combination.
More information about the torqueusers