[torquedev] Adding checkpoint/restart support to TORQUE

Lawrence M. Pezzaglia, Jr. larry at hpcrd.lbl.gov
Tue Aug 8 15:59:49 MDT 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I'm Larry Pezzaglia, a summer student from Lawrence Berkeley National
Laboratory.  I've been working to add Linux checkpoint/restart
functionality to TORQUE through Berkeley Lab Checkpoint Restart:

http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml


I have already added rudimentary BLCR support, but a more complete
implementation requires better integration with TORQUE's session system.

Since BLCR does not currently support checkpointing sessions, my current
code works by determining the PID owned by the session and checkpointing
it directly.  Of course, it would be best to checkpoint the session
itself, but until session checkpoint support is added to BLCR, this is
probably the best option.

How difficult would it be to modify the job structure to store the
PID(s) of the job as well as the session ID?  Would this be easier than
iterating through all the processes at checkpoint time to find those
owned by the right session?  If so, what would be the best way to add
this functionality?


Currently, I am trying to figure how to properly reconstruct the session
 around the restarted process.  I am having trouble finding where and
how a TORQUE job script is turned into a job structure and sent to the
Mother Superior for execution.  My (incomplete) picture of the process is:

1) User submits a job script with qsub.
2) pbs_server reads the job script, stores information in a job
structure, and submits the job to Mother Superior.  The job itself is a
session created in set_job() in resmom/linux/mom_start.c
3) Mother Superior instructs sisters to run the job.

I'm missing details regarding the process between (1) and (2).  What
happens to the original shell script?  Where in the source tree should I
look to find the code that handles these steps?


I'm also trying to figure out the exact conditions that cause a call to
mom_restart_job.  I can see it is called within TMomFinalizeJob1, but
I'm not sure as to when and how often the mom runs through the code in
TMomFinalizeJob1.



Thanks for the help,

Larry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFE2QlVf8VSbJrnbUERAhsIAJ4poBXfkw2mZhJHpbnZFAK1HJn4SACfZM6m
hrKIlLfegCq4CGfZ4yreUDc=
=T0sV
-----END PGP SIGNATURE-----


More information about the torquedev mailing list