[torqueusers] MOM job launch redesign

Dave Jackson jacksond at supercluster.org
Fri Nov 19 11:02:01 MST 2004


  There is currently a problem in the design of the pbs_mom/pbs_server
handling of jobs with delayed job start.  This impacts the start of very
large jobs as well as the start of jobs on systems using a prolog or
systems with frequent network, memory, filesystem, or kernel delay
issues.  Currently, the pbs job start design works as follows:

- job is submitted
- job run request is issued by scheduler/admin
- pbs_server issues runjob request to mother superior pbs_mom
- pbs_mom sends join request to sister pbs_moms
- pbs_mom forks
  - child launches prolog, execs job
  - parent blocks waiting for child to report success/failure
- pbs_server issues commit runjob request to mother superior pbs_mom   
  The problem occurs if the pbs_mom job child takes a while to run the
prolog or exec the child.  By default, the mom prologalarm is set to 5
minutes.  By default, the pbs_server commit request times out after 20
seconds.  If the job prolog takes longer than 20 seconds (may occur due
to stale file handles, low memory conditions, kernel delays, or other
system level issues) then the pbs_server commit will timeout and at the
server level, the job will be marked as idle.  On the mom side however,
once the prolog completes, the job will be marked as running.  

  Several issues may now occur:

1 - job runs but runs on 'N - 1' cpus
2 - the scheduler sees the job as idle, attempts to restart it
elsewhere, and things get corrupt because the job is already at least
partially running on the original nodes
3 - the job completes on the original nodes but the pbs_server daemon
rejects the completion because it thinks the job is idle and does not
allow idle jobs to complete
4 - scheduler/pbs_server appear to hang while the job commit times out

  The easy short term solution is to make certain that the mom
prologalarm config parameter is less than the pbs_server tcp_timeout
setting (see torque docs for more info).  

  We are currently making modifications to pbs_mom to change its
behavior to be more robust in this regard.  Our first change is to allow
the pbs_mom to respond to the commit request even before the prolog has
completed.  To do this, we have broken the job launch sequence into a
number of steps and are allowing polling of the status of the job child
while responding to continuing requests.  This design will allow the mom
to verify successful job initialization, respond to ongoing pbs_server
requests and queries (including the commit request), and perform local
tasks without blocking on the prolog/job launch.  This new capability
will initially only be enabled via a config setting.  If it proves to be
successful, we will set this behavior as the default in subsequent

  This one change should address all 4 of the problems listed above.

  As this is a major change, we are very interested in community
feedback and in beta evaluation.  This capability will first show up in
the pre patch 6 snapshots and will be documented in section 1.1 of the
online TORQUE docs.

  Thanks again for all of the contributions to TORQUE.  It has come a
long, long way!  If there are additional issues not being addressed at
the moment, please let us know.


More information about the torqueusers mailing list