[torquedev] Adding checkpoint/restart support to TORQUE

Garrick Staples garrick at clusterresources.com
Wed Aug 9 20:00:07 MDT 2006

On Tue, Aug 08, 2006 at 02:59:49PM -0700, Lawrence M. Pezzaglia, Jr. alleged:
> Hash: SHA1
> Hi all,
> I'm Larry Pezzaglia, a summer student from Lawrence Berkeley National
> Laboratory.  I've been working to add Linux checkpoint/restart
> functionality to TORQUE through Berkeley Lab Checkpoint Restart:
> http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml
> I have already added rudimentary BLCR support, but a more complete
> implementation requires better integration with TORQUE's session system.
> Since BLCR does not currently support checkpointing sessions, my current
> code works by determining the PID owned by the session and checkpointing
> it directly.  Of course, it would be best to checkpoint the session
> itself, but until session checkpoint support is added to BLCR, this is
> probably the best option.
> How difficult would it be to modify the job structure to store the
> PID(s) of the job as well as the session ID?  Would this be easier than
> iterating through all the processes at checkpoint time to find those
> owned by the right session?  If so, what would be the best way to add
> this functionality?

There are 2 parts in the job structure: the "ji_qs" part, and everything
else.  The ji_qs struct is a staticly-sized and is all of the quick bits
that you want to store on disk... changing this is an ABI change that
destroying further upgrades.  The rest of it is fine to change but it
won't be stored on disk.  You can also add a job attribute will be
stored on disk without breaking the ABI.

The only PIDs that pbs_mom knows about are the top-level user shell
(which isn't the interesting process) and children from the TM interface.

I'd imagine the only real solution is to search for PIDs in the session,
which is exactly what it does when killing processes and counting cput
and mem usage.

> Currently, I am trying to figure how to properly reconstruct the session
>  around the restarted process.  I am having trouble finding where and
> how a TORQUE job script is turned into a job structure and sent to the
> Mother Superior for execution.  My (incomplete) picture of the process is:
> 1) User submits a job script with qsub.
> 2) pbs_server reads the job script, stores information in a job
> structure, and submits the job to Mother Superior.  The job itself is a
> session created in set_job() in resmom/linux/mom_start.c
> 3) Mother Superior instructs sisters to run the job.
> I'm missing details regarding the process between (1) and (2).  What
> happens to the original shell script?  Where in the source tree should I
> look to find the code that handles these steps?

qsub reads the job script and parses out the #PBS directives, merges
them with command-line args, and presents all of this information as a
list of job attributes, with the job script, to pbs_submit(); which sends
the data over the wire to pbs_server.  pbs_server allocs a job struct and
populates it from the job attributes in req_quejob().

When pbs_server is given a jobrun request (which comes from qrun or a
scheduler), pbs_server sends a movejob request to pbs_mom that has the
job attributes and the job script.

> I'm also trying to figure out the exact conditions that cause a call to
> mom_restart_job.  I can see it is called within TMomFinalizeJob1, but
> I'm not sure as to when and how often the mom runs through the code in
> TMomFinalizeJob1.

TMomFinalizeJob1() is part of the process that starts jobs.  If the job
has the JOB_SVFLG_CHKPT flag set (which would have been added when the
job was checkpointed), then mom_restart_job() would be called.

More information about the torquedev mailing list