[torquedev] TORQUE checkpointing
keenandr at msu.edu
Thu Sep 9 00:37:50 MDT 2010
>> > 1. qhold copies checkpoints back from the node to the PBS server. The
>> > scp command is run as root on the node but the target of the copy is
>> > the
>> > user running the job; the equivalent of:
>> > root# scp checkpointfile user at head-node:/var/spool/torque/checkpoint
>> > Unless root has a key for user or /var/spool/torque is set as usecp,
>> > then this copy command will fail.
> This seems to be a design flaw, the checkpoint files should either be transferred and owned by root or by the user. It seems that the checkpoint files could be owned and transferred by root since they are not useful outside of Torque. Is there any reason why a user should own these files while they reside on the server?
Off the top of my head:
1. Simpler to account for the user's disk usage on the checkpoint target
directory and enforce file system quota for users (to prevent an
inadvertent DoS from one user sending many or large checkpoint files to
the server and filling the file system where checkpoints were stored.)
It may be better to put this logic in the server; something like:
as I don't know off of the top of my head if the checkpoint process is
robust enough to handle a file system full or over quota state. Does the
copy routine check to see if there is adequate space before copying the
checkpoint file? Should it?
2. User -> user copy has fewer potential security issues when using scp.
Our site doesn't allow root to ssh from the compute nodes in to our head
node. Does usecp apply here?
3. If the user owns the file, what happens if the user has access to the
checkpoint directory and removes the checkpoint file after the job is
checkpointed successfully but not restarted?
Conceptually, I think these are similar to spool files and should be
>> > 2. The server's checkpoint directory is hardcoded to
>> > PBS_HOME/checkpoint
>> > , the permissions of which are set by default so that only root can
>> > write to it (which breaks the behavior as described in #1)
> It should be straight forward to specify the servers checkpoint directory at configure time. The only problem I see is that if the specified path is a NFS share with root_squash turned on then root can not access it. I guess this depends on whether the files end up being owned by root or the user in question 1.
Of course, a user could mount a root_squash file system on the
preconfigured checkpoint directory even without that option. A configure
time option would be a good start.
I'm concerned that the node running pbs_server would become the I/O
bottleneck when staging checkpoint files. Ideally, it would be possible
to configure the mom to read and write checkpoints directly to a shared
filesystem (possible to approximate this behavior with usecp?) but what
happens if /fastnfsstorage isn't on all nodes?
>> > 4. BLCR supports image migration between nodes, but TORQUE's BLCR
>> > allows
>> > migration between nodes and the qhold command migrates the checkpoint
>> > file back to the server, but TORQUE specifically checks and only
>> > allows
>> > held jobs with checkpoint files to run on the node the checkpoint was
>> > taken on. Is there a specific reason TORQUE disallows node migration
>> > of
>> > checkpoints?
> I don't know the history on this but it appears that this check (in req_runjob.c) has been in ever since the start of the svn repository. If someone has a test system that would support image migration, it would be worth removing the check and seeing what issues show up when trying to migrate a blcr checkpoint job.
It may be a while before I can dedicate some time to check this, but I
will add this to my queue.
More information about the torquedev