[torqueusers] extra 1.2.0p6 notes

Garrick Staples garrick at usc.edu
Thu Sep 15 13:48:19 MDT 2005


I want to mention some of the new features in 1.2.0p6 real quick.

First, make sure 'poll_jobs' is enabled in qmgr.  It's not a new feature
in p6, but it is now enabled by default.  The server attribute
job_stat_rate now defaults to 45, but upgraded servers will still have
the previous default of 120 (so take a look, you might want to change
this).


pbs_mom has a new static parameter called "opsys" (goes along nicely
with "arch") that is included in the status attribute.  The latest
maui snapshots will schedule jobs based on "opsys", so you can do things
like 'qsub -l arch=x86_64,opsys=linux'.



The big thing is that communications between server and MOM was
revamped to tighten things up.  The idea is restarting pbs_server should
be much faster and any state mismatches are resolved quickly.  The
definition of the various node states was clarified and changed subtly
from 1.2.0p5.

  down - a node that hasn't reported its status in node_check_rate
seconds or, optionally, is returning an error message to the server.
pbs_server should never attempt any additional communication to a "down"
node until the node reports itself and/or clears its error message.

  offline - pbs_server will continue basic communications to "offline"
nodes but no new jobs will be scheduled.  'pbsnodes -o/-c' now simply
adds/removes the "offline" flag without overriding existing state flags.
This means states like "down,offline" will be common and distinct from
a node in the "offline" state.

  state-unknown - mostly unused now.  Nodes initially have the
"state-unknown" flag on server startup and should be removed
immediately.



We have some new performance tuning configs in pbs_server and pbs_mom.
Starting with MOM, check pbs_mom(8B) for $check_poll_time
$status_update_time.  These can also be set on-the-fly with momctl,
'momctl -q status_update_time=60 -h host,host,...'

While these aren't new, see pbs_server_attributes(7B) for job_stat_rate
and node_check_rate.

Several MOM configs can now be adjusted on-the-fly with momctl mentioned
above.  They are listed in pbs_mom(8B).  



There's also a few experimental features that need some playing and
feedback.

pbs_server has a server attribute called "job_nanny" that is not set by
default.  It sets up a persistant task within the server when a request
comes in to delete a job.  If for some reason MS isn't responding
(causing the kill request to be lost), pbs_server will retry it every
minute.  It always disallows additional job delete requests returning
the error message "job cancel in progress"; a nice side effect is users
don't get bombarded with emails.  I've been using this in production for
about 6 weeks and my users are happy.  I don't know how long this
feature will last, the overall ideal is async job deletes which makes
job_nanny useless.

pbs_server has a server attribute called "down_on_error" that will mark
nodes down if they issue an "ERROR" message (see HEALTH CHECK in
pbs_mom(8B)).  This overlaps with a maui/moab feature.

pbs_mom has a new config called "$down_on_error" that causes MOM to
report itself as "down" if it finds an "ERROR" message.  I just noticed
that I totally goofed on setting $down_on_error with momctl.  'momctl -q
down_on_error' reports the wrong value, so ignore it.  And it can only
be enabled, not disabled.  Fortunately it is an experimental feature
that is disabled by default.

Enabling "down_on_error" on both pbs_server and pbs_mom is redundant.
If you like the idea, I'd recommend trying the pbs_server config before
the pbs_mom config.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050915/9bfad0a5/attachment.bin


More information about the torqueusers mailing list