[torquedev] BLCR changes to be put in Torque 2.5.3
ERoman at lbl.gov
Tue Oct 5 15:06:37 MDT 2010
I'm having a problem where I can't issue checkpoints to multi-node jobs. I'm
seeing ENOSUPPORT errors back from BLCR when I try to run a job with the nodes
attribute set to more than 1. I'm not sure where the error is coming from. It
looks like cr_run isn't being invoked to start the first job script, or perhaps
cr_run is being set, but the linker variables (LD_LIBRARY_PATH, LD_PRELOAD) are
being lost due to a later exec() somewhere clearing the environment.
There's a second problem as well that I need to see addressed. This has to
do with the way qrls reallocates the nodes allocated to the jobs. There's a
bugzilla entry about this from a ways back:
qrls tries to use the exec_host field of a checkpointed as a node spec. I
think this is so that the restarted job can run on the same nodes as the
original. The problem is that the exec_host field is not a valid node spec,
so torque cannot reallocate the nodes, and the qrls fails.
On Tue, Oct 05, 2010 at 01:40:03PM -0600, Al Taufer wrote:
> We would like to put the following changes into the Torque 2.5.3 release.
> 1) Add --with-servchkptdir configure option which allows specifying a different path for the servers checkpoint files. To do this we need to change the current behaviour on the pbs_mom. Currently, when the pbs_mom creates checkpoint images in the default location it creates them in a subdirectory based on job ID (ie. 200216.molo.CK). But when the job has a checkpoint_dir specified then the checkpoint images are created directly in the checkpoint_dir path without any job ID subdirectory. The pbs_mom will now always create checkpoint images in a Job ID subdirectory.
> 2) Change so all checkpoint file transfers occur as the user instead of as root. This also changes the permissions on the $TORQUEHOME/checkpoint directory to be world writable with the sticky bit set.
> There have been a few requests to have the pbs_mom invoke the restart_script as the user instead of as root, which is how it currently works. We don't think this is needed since the restart_script in the /contrib/blcr directory already runs the actual cr_restart command as the user so there should not be any access issues due to filesystems with root squash turned on.
> Please let me know if you have any concerns.
> Al Taufer
> Adaptive Computing
> torquedev mailing list
> torquedev at supercluster.org
More information about the torquedev