[torqueusers] Nodes that pbs reports are busy which are actually running a job

Rahul Nabar rpnabar at gmail.com
Wed Aug 11 15:43:39 MDT 2010


I have a node where pbsnodes reports the following:

eu044
     state = busy
     np = 8
     properties = INTEL,10GigE
     ntype = cluster
     status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
Sep 3 03:28:30 EDT 2009
x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538

Since it doesn't show "job-exclusive" I assumed it means it doesn't
have a user job on it. But if I login to eu044 and do a top I see:

######################
top - 16:38:27 up 48 days,  3:53,  1 user,  load average: 9.00, 9.00, 9.00
Tasks: 155 total,   7 running, 148 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.0%us, 93.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16429872k total, 16350560k used,    79312k free,     7336k buffers
Swap:  8385920k total,  8385920k used,        0k free,    14416k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
25254 gwpeng    25   0 2224m 817m  176 S 100.2  5.1   8879:07
vasp_gamma
25253 gwpeng    25   0 2307m 861m  176 R 99.9  5.4   8879:10
vasp_gamma
25255 gwpeng    25   0 2334m 1.4g  180 S 99.9  8.9   8879:20
vasp_gamma
25256 gwpeng    25   0 2334m 1.4g  176 S 99.9  8.7   8879:19
vasp_gamma
25257 gwpeng    25   0 2292m 919m  176 R 99.9  5.7   8879:15
vasp_gamma
25258 gwpeng    25   0 2333m 730m  176 R 99.9  4.6   8879:40
vasp_gamma
25259 gwpeng    25   0 2326m 942m  176 R 99.9  5.9   8879:13
vasp_gamma
25260 gwpeng    25   0 2204m 843m  176 R 99.9  5.3   8879:18
vasp_gamma
#############################

These are 8 core machines so I can understand that PBS reports busy
because the load average is 9 (>8).

But why does pbsnodes not list the node as job-exclusive as well? It
doesn't even seem to report a job number for that node.

The mom seems to be running on the node:

[root at eu044 ~]# service pbs status
pbs_mom is pid 3810

But a momctl reveals that the mom doesn't think there is a local job:

##############################
[root at eu044 ~]# /opt/torque/sbin/momctl -d 3

Host: eu044/eu044   Version: 2.4.5   PID: 3810
Server[0]: euadmin (10.0.3.2:1023)
  Init Msgs Received:     5 hellos/2 cluster-addrs
  Init Msgs Sent:         11 hellos
  Last Msg From Server:   529523 seconds (DeleteJob)
  Last Msg To Server:     8 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (1834324
blocks available)
NOTE:  syslog enabled
MOM active:             4161213 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               4 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:
10.0.0.43,10.0.0.42,10.0.0.41,10.0.0.40,10.0.0.39,10.0.0.38,10.0.0.37,10.0.0.36,10.0.0.35,10.0.0.34,10.0.0.33,10.0.0.32,10.0.0.31,10.0.0.30,10.0.0.29,10.0.0.28,10.0.0.27,10.0.0.26,10.0.0.25,10.0.0.24,10.0.0.23,10.0.0.22,10.0.0.21,10.0.0.20,10.0.0.19,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.14,10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.2.61,10.0.2.60,10.0.2.59,10.0.2.58,10.0.2.57,10.0.2.56,10.0.2.55,10.0.2.54,10.0.2.53,10.0.2.52,10.0.2.51,10.0.2.50,10.0.2.49,10.0.2.48,10.0.2.47,10.0.2.46,10.0.2.45,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete
#############################


I tried restarting the mom but it still doesnt detect a job!


-- 
Rahul


More information about the torqueusers mailing list