[torqueusers] jobs stuck in queue until I force execution with qrun

Christina Salls christina.salls at noaa.gov
Thu Feb 16 15:19:20 MST 2012


On Thu, Feb 16, 2012 at 4:05 PM, Gustavo Correa <gus at ldeo.columbia.edu>wrote:

> PS - For some diagnostic, you could also try '$TORQUE/bin/pbsnodes' on the
> server,
>
[root at wings ~]# pbsnodes
n001.default.domain
     state = free
     np = 1
     ntype = cluster
     status =
rectime=1329430696,varattr=,jobs=,state=free,netload=42970654,gres=,loadave=0.03,ncpus=24,physmem=20463136kb,availmem=27788364kb,totmem=28655128kb,idletime=177266,nusers=1,nsessions=1,sessions=17382,uname=Linux
n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011
x86_64,opsys=linux
     gpus = 0

n002.default.domain
     state = free
     np = 1
     ntype = cluster
     status =
rectime=1329430653,varattr=,jobs=,state=free,netload=41152440,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31877036kb,totmem=32792076kb,idletime=177252,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
10 15:42:40 EDT 2011 x86_64,opsys=linux
     gpus = 0

These look good, right?



> and '$TORQUE/sbin/momctl -d 3'  on the compute nodes.
>

[root at n001 sbin]# momctl -d 3

Host: n001/n001.default.domain   Version: 2.5.9   PID: 3598
Server[0]: admin.default.domain (10.0.10.1:1023)
  Init Msgs Received:     2 hellos/2 cluster-addrs
  Init Msgs Sent:         6 hellos
  Last Msg From Server:   8595 seconds (DeleteJob)
  Last Msg To Server:     32 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (23252610 blocks
available)
NOTE:  syslog enabled
MOM active:             176853 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:
 10.0.1.20,10.0.1.19,10.0.1.18,10.0.1.17,10.0.1.16,10.0.1.15,10.0.1.14,10.0.1.13,10.0.1.12,10.0.1.11,10.0.1.10,10.0.1.9,10.0.1.8,10.0.1.7,10.0.1.6,10.0.1.5,10.0.1.4,10.0.1.3,10.0.1.2,10.0.10.1,10.0.1.1,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete



> Gus Correa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/b2314cf0/attachment.html 


More information about the torqueusers mailing list