[torqueusers] ANNOUNCE: pestat v.2.11: Print a 1-line summary of jobs on each node
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Sep 27 06:33:26 MDT 2011
Dear Torque users,
There is an updated pestat version 2.11 available from
ftp://ftp.fysik.dtu.dk/pub/Torque/pestat.
New features are:
1. The Torque pbs_mom records network load information as the sum of
transmit+receive of all interfaces. The "netload" information is defined in the
source file ./src/resmom/linux/mom_mach.c as the sum of bytes on all network
interfaces since boot time, read from /proc/net/dev.
The pestat command (from version 2.9) prints delta-netload information when
run twice with some time interval in between.
The file $NETLOADFILE stores recorded information.
The baseline netload information may be generated from cron, say, every 10
minutes by this crontab entry:
*/10 * * * * /usr/local/bin/pestat -C > /dev/null
If the netload exceeds NETLOADTHRES (2000 Mbit/sec full-duplex), this node will
be flagged. Please change NETLOADTHRES if you want to flag lower netloads.
If your nodes use Ethernet port bonding, please configure the NETLOADSCALE
variable in the script.
2. The "-j jobs" flag lists only those nodes that run at least "jobs" user
jobs. If your site policy permits multiple jobs per node, you can use this
flag to check specifically any multi-job nodes.
General info:
The pestat utility is used for printing a 1-line summary of jobs on each node.
It parses the output of "pbsnodes -a" and presents the output in a compact,
useful format. In particular we use pestat all the time to display only those
nodes which have jobs that behave in an unexpected way, for example:
# pestat -f
Listing only nodes that are flagged by *
node state load pmem ncpu mem resi usrs tasks jobids/users
n031 excl 0* 7990 4 23992 1380 1/1 4 381711 user1
n040 excl 0* 7990 4 23992 1061 1/1 4 381620 user1
n045 free 0.68* 7990 4 23992 139 0/0 0
n046 free 0.69* 7990 4 23992 140 0/0 0
p013 excl 1* 24110 4 56110 296 1/1 4 400491 user2
p014 excl 1* 24110 4 56110 16036 1/1 4 400491 user2
a063 excl 9.5* 24098 8 72097 1370 1/1 8 400325 user3
a126 excl 8.7* 24098 8 72097 7110 1/1 8 400260 user5
b003 excl 8.5* 24098 8 72097 985 1/1 8 400333 user3
b074 excl 8.6* 24098 8 72097 17123 1/1 8 399435 user4
b109 excl 8.6* 24098 8 72097 1062 1/1 8 400334 user3
c103 excl 8.6* 24098 8 72097 17080 1/1 8 399437 user4
c140 busy* 8 24098 8 72097 20130 1/1 8 393075 user7
d015 excl 5* 24098 8 72097 7235 1/1 8 400453 user6
d034 excl 5* 24098 8 72097 7213 1/1 8 400453 user6
d040 excl 8.5* 24098 8 72097 1177 1/1 8 400350 user3
d050 excl 8.7* 24098 8 72097 17197 1/1 8 399438 user4
Usage: /usr/local/bin/pestat [-f] [-c|-n] [-d] [-V] [-u username|-g groupname]
[-j jobs] [-C] [-h]
where:
-f: Listing only nodes that are flagged by \*
-d: Listing also nodes that are down
-c/-n: Color/no color output
-u username: Print only user <username> (do not use with the -g flag)
-g groupname: Print only users in group <groupname>
-j jobs: List only nodes with at least <jobs> running jobs
-C: Use with cron: Netload file will be saved as /tmp/netload.cron
-h: Print this help information
-V: Version information
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
More information about the torqueusers
mailing list