[torqueusers] jobs stuck in queue until I force execution with qrun
Christina Salls
christina.salls at noaa.gov
Thu Feb 16 15:19:20 MST 2012
On Thu, Feb 16, 2012 at 4:05 PM, Gustavo Correa <gus at ldeo.columbia.edu>wrote:
> PS - For some diagnostic, you could also try '$TORQUE/bin/pbsnodes' on the
> server,
>
[root at wings ~]# pbsnodes
n001.default.domain
state = free
np = 1
ntype = cluster
status =
rectime=1329430696,varattr=,jobs=,state=free,netload=42970654,gres=,loadave=0.03,ncpus=24,physmem=20463136kb,availmem=27788364kb,totmem=28655128kb,idletime=177266,nusers=1,nsessions=1,sessions=17382,uname=Linux
n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011
x86_64,opsys=linux
gpus = 0
n002.default.domain
state = free
np = 1
ntype = cluster
status =
rectime=1329430653,varattr=,jobs=,state=free,netload=41152440,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31877036kb,totmem=32792076kb,idletime=177252,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May
10 15:42:40 EDT 2011 x86_64,opsys=linux
gpus = 0
These look good, right?
> and '$TORQUE/sbin/momctl -d 3' on the compute nodes.
>
[root at n001 sbin]# momctl -d 3
Host: n001/n001.default.domain Version: 2.5.9 PID: 3598
Server[0]: admin.default.domain (10.0.10.1:1023)
Init Msgs Received: 2 hellos/2 cluster-addrs
Init Msgs Sent: 6 hellos
Last Msg From Server: 8595 seconds (DeleteJob)
Last Msg To Server: 32 seconds
HomeDirectory: /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (23252610 blocks
available)
NOTE: syslog enabled
MOM active: 176853 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List:
10.0.1.20,10.0.1.19,10.0.1.18,10.0.1.17,10.0.1.16,10.0.1.15,10.0.1.14,10.0.1.13,10.0.1.12,10.0.1.11,10.0.1.10,10.0.1.9,10.0.1.8,10.0.1.7,10.0.1.6,10.0.1.5,10.0.1.4,10.0.1.3,10.0.1.2,10.0.10.1,10.0.1.1,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
> Gus Correa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/b2314cf0/attachment.html
More information about the torqueusers
mailing list