[torquedev] BLCR changes to be put in Torque 2.5.3

Eric Roman ERoman at lbl.gov
Tue Oct 5 15:06:37 MDT 2010


I'm having a problem where I can't issue checkpoints to multi-node jobs.  I'm
seeing ENOSUPPORT errors back from BLCR when I try to run a job with the nodes
attribute set to more than 1.  I'm not sure where the error is coming from.  It
looks like cr_run isn't being invoked to start the first job script, or perhaps
cr_run is being set, but the linker variables (LD_LIBRARY_PATH, LD_PRELOAD) are
being lost due to a later exec() somewhere clearing the environment.

There's a second problem as well that I need to see addressed.  This has to
do with the way qrls reallocates the nodes allocated to the jobs.  There's a
bugzilla entry about this from a ways back:

http://www.clusterresources.com/bugzilla/show_bug.cgi?id=68

qrls tries to use the exec_host field of a checkpointed as a node spec.  I
think this is so that the restarted job can run on the same nodes as the
original.  The problem is that the exec_host field is not a valid node spec,
so torque cannot reallocate the nodes, and the qrls fails.

Eric

On Tue, Oct 05, 2010 at 01:40:03PM -0600, Al Taufer wrote:
> We would like to put the following changes into the Torque 2.5.3 release.
> 
> 1) Add --with-servchkptdir configure option which allows specifying a different path for the servers checkpoint files. To do this we need to change the current behaviour on the pbs_mom. Currently, when the pbs_mom creates checkpoint images in the default location it creates them in a subdirectory based on job ID (ie. 200216.molo.CK).  But when the job has a checkpoint_dir specified then the checkpoint images are created directly in the checkpoint_dir path without any job ID subdirectory.  The pbs_mom will now always create checkpoint images in a Job ID subdirectory.
> 
> 2) Change so all checkpoint file transfers occur as the user instead of as root.  This also changes the permissions on the $TORQUEHOME/checkpoint directory to be world writable with the sticky bit set.
> 
> There have been a few requests to have the pbs_mom invoke the restart_script as the user instead of as root, which is how it currently works.  We don't think this is needed since the restart_script in the /contrib/blcr directory already runs the actual cr_restart command as the user so there should not be any access issues due to filesystems with root squash turned on.
> 
> Please let me know if you have any concerns.
> 
> Al Taufer
> Adaptive Computing
> 
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list