[torquedev] TORQUE checkpointing

Al Taufer ataufer at adaptivecomputing.com
Tue Sep 7 15:05:01 MDT 2010


----- Original Message -----
> Hi,
> 
> I've been working with TORQUE's BLCR support, but have noticed some
> issues in my testing and would like feedback to see if that I'm
> misunderstanding the behavior I'm seeing.
> 
> 1. qhold copies checkpoints back from the node to the PBS server. The
> scp command is run as root on the node but the target of the copy is
> the
> user running the job; the equivalent of:
> 
> root# scp checkpointfile user at head-node:/var/spool/torque/checkpoint
> 
> Unless root has a key for user or /var/spool/torque is set as usecp,
> then this copy command will fail.
> 

This seems to be a design flaw, the checkpoint files should either be transferred and owned by root or by the user.   It seems that the checkpoint files could be owned and transferred by root since they are not useful outside of Torque.  Is there any reason why a user should own these files while they reside on the server?

> 2. The server's checkpoint directory is hardcoded to
> PBS_HOME/checkpoint
> , the permissions of which are set by default so that only root can
> write to it (which breaks the behavior as described in #1)
> 

It should be straight forward to specify the servers checkpoint directory at configure time.  The only problem I see is that if the specified path is a NFS share with root_squash turned on then root can not access it.  I guess this depends on whether the files end up being owned by root or the user in question 1.

> 3. Checkpoint files are not deleted on the child after qhold migration
> completes.

This is a bug and I will look at getting rid of the residual files.

> 
> 4. BLCR supports image migration between nodes, but TORQUE's BLCR
> allows
> migration between nodes and the qhold command migrates the checkpoint
> file back to the server, but TORQUE specifically checks and only
> allows
> held jobs with checkpoint files to run on the node the checkpoint was
> taken on. Is there a specific reason TORQUE disallows node migration
> of
> checkpoints?

I don't know the history on this but it appears that this check (in req_runjob.c) has been in ever since the start of the svn repository.  If someone has a test system that would support image migration, it would be worth removing the check and seeing what issues show up when trying to migrate a blcr checkpoint job.

> 
> 5. cr_restart appears to run as root, so BLCR cannot reopen files that
> root cannot access (e.g. files on an NFS share with root_squash on)
> and
> is unable to restart the checkpoint file.

cr_restart should probably be run as the user and the BLCR docs say that it can run as the user.  I will look at changing it to run as the user.

> 
> We're experimenting with handling checkpointing within the job, but
> because TERM is to all processes rather than the parent shell we can't
> just trap the signal in the shell and run a checkpoint-and-terminate
> command on receipt of a KILL. Is it possible for Moab to send a signal
> (USR1?) to a job before terminating/preempting it?
> 
> Does anyone have any experience using BLCR in production?
> 
> As an aside, what tools do people recommend for working with the
> TORQUE
> source code? My generic system administrator vim configuration is not
> holding up. Are there any particular particular IDE or tools useful?
> 
> -Andy
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


Al Taufer
Adaptive Computing


More information about the torquedev mailing list