[torqueusers] Question about checkpoint for MPI
levin108 at gmail.com
Tue Dec 4 01:29:39 MST 2012
We're trying to use torque to checkpoint MPI jobs, but it seems that
torque can only handle jobs running on a single node, I checked the
code and found that when using qhold to checkpoint a job, qhold sends
a PBS_BATCH_HoldJob request to pbs server, then pbs server relays this
request to master host, and then master host checkpoints the job
processes running on itself with BLCR, but not send the request to
its sister nodes, so it seems that MPI jobs can not be checkpointable.
I'm not sure whether I'm right, is there anybody who can tell me
whether it is true with torque, or if I'm right, do you have any plans
to make checkpoint for MPI jobs available ?
More information about the torqueusers