[torqueusers] ANNOUNCE: pestat v.2.11: Print a 1-line summary of jobs on each node

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Sep 27 06:33:26 MDT 2011


Dear Torque users,

There is an updated pestat version 2.11 available from
ftp://ftp.fysik.dtu.dk/pub/Torque/pestat.

New features are:

1. The Torque pbs_mom records network load information as the sum of
transmit+receive of all interfaces.  The "netload" information is defined in the
source file ./src/resmom/linux/mom_mach.c as the sum of bytes on all network
interfaces since boot time, read from /proc/net/dev.

The pestat command (from version 2.9) prints delta-netload information when
run twice with some time interval in between.
The file $NETLOADFILE stores recorded information.
The baseline netload information may be generated from cron, say, every 10
minutes by this crontab entry:
*/10 * * * * /usr/local/bin/pestat -C > /dev/null

If the netload exceeds NETLOADTHRES (2000 Mbit/sec full-duplex), this node will
be flagged. Please change NETLOADTHRES if you want to flag lower netloads.

If your nodes use Ethernet port bonding, please configure the NETLOADSCALE
variable in the script.

2. The "-j jobs" flag lists only those nodes that run at least "jobs" user
jobs.  If your site policy permits multiple jobs per node, you can use this
flag to check specifically any multi-job nodes.

General info:
The pestat utility is used for printing a 1-line summary of jobs on each node.
It parses the output of "pbsnodes -a" and presents the output in a compact,
useful format.  In particular we use pestat all the time to display only those
nodes which have jobs that behave in an unexpected way, for example:

# pestat -f
Listing only nodes that are flagged by *
      node state  load    pmem ncpu   mem   resi usrs tasks  jobids/users
      n031  excl     0*   7990   4  23992   1380  1/1    4    381711 user1
      n040  excl     0*   7990   4  23992   1061  1/1    4    381620 user1
      n045  free  0.68*   7990   4  23992    139  0/0    0
      n046  free  0.69*   7990   4  23992    140  0/0    0
      p013  excl     1*  24110   4  56110    296  1/1    4    400491 user2
      p014  excl     1*  24110   4  56110  16036  1/1    4    400491 user2
      a063  excl   9.5*  24098   8  72097   1370  1/1    8    400325 user3
      a126  excl   8.7*  24098   8  72097   7110  1/1    8    400260 user5
      b003  excl   8.5*  24098   8  72097    985  1/1    8    400333 user3
      b074  excl   8.6*  24098   8  72097  17123  1/1    8    399435 user4
      b109  excl   8.6*  24098   8  72097   1062  1/1    8    400334 user3
      c103  excl   8.6*  24098   8  72097  17080  1/1    8    399437 user4
      c140  busy*    8   24098   8  72097  20130  1/1    8    393075 user7
      d015  excl     5*  24098   8  72097   7235  1/1    8    400453 user6
      d034  excl     5*  24098   8  72097   7213  1/1    8    400453 user6
      d040  excl   8.5*  24098   8  72097   1177  1/1    8    400350 user3
      d050  excl   8.7*  24098   8  72097  17197  1/1    8    399438 user4

Usage: /usr/local/bin/pestat [-f] [-c|-n] [-d] [-V] [-u username|-g groupname] 
[-j jobs] [-C] [-h]
where:
         -f: Listing only nodes that are flagged by \*
         -d: Listing also nodes that are down
         -c/-n: Color/no color output
         -u username: Print only user <username> (do not use with the -g flag)
         -g groupname: Print only users in group <groupname>
         -j jobs: List only nodes with at least <jobs> running jobs
         -C: Use with cron: Netload file will be saved as /tmp/netload.cron
         -h: Print this help information
         -V: Version information


-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list