[torqueusers] Nodes that pbs reports are busy which are actually running a job

David Beer dbeer at adaptivecomputing.com
Wed Aug 11 15:55:15 MDT 2010



----- Original Message -----
> I have a node where pbsnodes reports the following:
> 
> eu044
> state = busy
> np = 8
> properties = INTEL,10GigE
> ntype = cluster
> status = opsys=linux,uname=Linux eu044 2.6.18-164.el5 #1 SMP Thu
> Sep 3 03:28:30 EDT 2009
> x86_64,sessions=25252,nsessions=1,nusers=1,idletime=4160964,totmem=24815792kb,availmem=103236kb,physmem=16429872kb,ncpus=8,loadave=9.00,netload=174910266926482,state=busy,jobs=,varattr=,rectime=1281562538
> 
> Since it doesn't show "job-exclusive" I assumed it means it doesn't
> have a user job on it. But if I login to eu044 and do a top I see:
> 
> ######################
> top - 16:38:27 up 48 days, 3:53, 1 user, load average: 9.00, 9.00,
> 9.00
> Tasks: 155 total, 7 running, 148 sleeping, 0 stopped, 0 zombie
> Cpu(s): 6.0%us, 93.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 16429872k total, 16350560k used, 79312k free, 7336k buffers
> Swap: 8385920k total, 8385920k used, 0k free, 14416k cached
> 
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 25254 gwpeng 25 0 2224m 817m 176 S 100.2 5.1 8879:07
> vasp_gamma
> 25253 gwpeng 25 0 2307m 861m 176 R 99.9 5.4 8879:10
> vasp_gamma
> 25255 gwpeng 25 0 2334m 1.4g 180 S 99.9 8.9 8879:20
> vasp_gamma
> 25256 gwpeng 25 0 2334m 1.4g 176 S 99.9 8.7 8879:19
> vasp_gamma
> 25257 gwpeng 25 0 2292m 919m 176 R 99.9 5.7 8879:15
> vasp_gamma
> 25258 gwpeng 25 0 2333m 730m 176 R 99.9 4.6 8879:40
> vasp_gamma
> 25259 gwpeng 25 0 2326m 942m 176 R 99.9 5.9 8879:13
> vasp_gamma
> 25260 gwpeng 25 0 2204m 843m 176 R 99.9 5.3 8879:18
> vasp_gamma
> #############################
> 
> These are 8 core machines so I can understand that PBS reports busy
> because the load average is 9 (>8).
> 
> But why does pbsnodes not list the node as job-exclusive as well? It
> doesn't even seem to report a job number for that node.
> 
> The mom seems to be running on the node:
> 
> [root at eu044 ~]# service pbs status
> pbs_mom is pid 3810
> 
> But a momctl reveals that the mom doesn't think there is a local job:
> 
> ##############################
> [root at eu044 ~]# /opt/torque/sbin/momctl -d 3
> 
> Host: eu044/eu044 Version: 2.4.5 PID: 3810
> Server[0]: euadmin (10.0.3.2:1023)
> Init Msgs Received: 5 hellos/2 cluster-addrs
> Init Msgs Sent: 11 hellos
> Last Msg From Server: 529523 seconds (DeleteJob)
> Last Msg To Server: 8 seconds
> HomeDirectory: /var/spool/torque/mom_priv
> stdout/stderr spool directory: '/var/spool/torque/spool/' (1834324
> blocks available)
> NOTE: syslog enabled
> MOM active: 4161213 seconds
> Check Poll Time: 45 seconds
> Server Update Interval: 45 seconds
> LogLevel: 4 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: TCP
> MemLocked: TRUE (mlock)
> Prolog: /var/spool/torque/mom_priv/prologue (disabled)
> Alarm Time: 0 of 10 seconds
> Trusted Client List:
> 10.0.0.43,10.0.0.42,10.0.0.41,10.0.0.40,10.0.0.39,10.0.0.38,10.0.0.37,10.0.0.36,10.0.0.35,10.0.0.34,10.0.0.33,10.0.0.32,10.0.0.31,10.0.0.30,10.0.0.29,10.0.0.28,10.0.0.27,10.0.0.26,10.0.0.25,10.0.0.24,10.0.0.23,10.0.0.22,10.0.0.21,10.0.0.20,10.0.0.19,10.0.0.18,10.0.0.17,10.0.0.16,10.0.0.15,10.0.0.14,10.0.0.13,10.0.0.12,10.0.0.11,10.0.0.10,10.0.0.9,10.0.0.8,10.0.0.7,10.0.0.6,10.0.0.5,10.0.0.4,10.0.0.3,10.0.0.2,10.0.0.1,10.0.2.61,10.0.2.60,10.0.2.59,10.0.2.58,10.0.2.57,10.0.2.56,10.0.2.55,10.0.2.54,10.0.2.53,10.0.2.52,10.0.2.51,10.0.2.50,10.0.2.49,10.0.2.48,10.0.2.47,10.0.2.46,10.0.2.45,127.0.0.1
> Copy Command: /usr/bin/scp -rpB
> NOTE: no local jobs detected
> 
> diagnostics complete
> #############################
> 
> 
> I tried restarting the mom but it still doesnt detect a job!
> 
> 
> --
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Rahul,

It doesn't report a job because TORQUE doesn't know of any job present there. pbsnodes will output a job if the server believes a job is present, and momctl can tell you for sure if there is a job. At this point you should probably run top and see what is using all of the resources.

-- 
David Beer | Senior Software Engineer
Adaptive Computing


More information about the torqueusers mailing list