[torqueusers] pbs_mom and server issues
scoggins
jscoggins at lbl.gov
Thu Aug 28 18:36:57 MDT 2008
Torque 2.1.3 problem:
I am getting the following error message when I qsub a job:
Message[0] job cannot be started on RM sched-00 - cannot set
hostlist: cannot set job '98.sched-00 ' attr
'Resource_List:neednodes' to 'n0000.ikea:ppn=4+n0001.ikea:ppn=4' (rc:
15070 'Server could not connect to MOM')
I can not figure out why.
I ran pbs_iff -t n0000.ikea 15002 and I get the following error:
...
poll([{fd=3, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|
O_NONBLOCK)
read(3, "+2+15+15005+0+72+41Unknown reque"..., 262144) = 60
write(2, "pbs_iff: Unknown request MSG=can"..., 51pbs_iff: Unknown
request MSG=cannot decode message
) = 51
exit_group(1) = ?
PBS commands output:
pbsnodes -a n0000.ikea
n0000.ikea
state = free
np = 8
properties = ikea,quadcore
ntype = cluster
status = opsys=linux,uname=Linux n0000.ikea 2.6.18-92.1.10.el5
#1 SMP Tue Aug 5 07:42:41 EDT 2008 x86_64,sessions=? 0,nsessions=?
0,nusers=0,idletime=784246,totmem=48453372kb,availmem=48342900kb,physmem
=16443868kb,ncpus=8,loadave=0.00,netload=98910831,state=free,jobs=,varat
tr=,rectime=1219968506
momctl -h n0000.ikea -d 9
Host: n0000.ikea/n0000.ikea Version: 2.3.1 PID: 27784
Server[0]: sched-00 (10.0.0.30:15001)
Init Msgs Received: 2 hellos/2 cluster-addrs
Init Msgs Sent: 3 hellos
Last Msg From Server: 2620 seconds (CLUSTER_ADDRS)
Last Msg To Server: 10 seconds
HomeDirectory: /var/spool/torque/ikea/n0000/mom_priv
stdout/stderr spool directory: '/var/spool/torque/ikea/n0000/
spool/' (3472979 blocks available)
NOTE: syslog enabled
HomeDirectory: /var/spool/torque/ikea/n0000/mom_priv
MOM active: 6142 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/torque/ikea/n0000/mom_priv/
prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List:
10.0.2.9,10.0.2.7,10.0.2.6,10.0.2.5,10.0.2.4,10.0.2.3,10.0.2.2,10.0.2.1,
10.0.2.0,10.0.0.30,10.0.7.18,10.0.7.17,10.0.7.16,10.0.7.15,10.0.7.14,10.
0.7.13,10.0.7.12,10.0.7.11,10.0.7.10,10.0.7.9,10.0.7.8,10.0.7.7,10.0.7.6
,10.0.7.5,10.0.7.4,10.0.7.3,10.0.7.2,10.0.7.1,10.0.7.0,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
Here is what the server_logs are saying:
08/28/2008 17:33:09;0001;PBS_Server;Req;;Server could not connect to MOM
08/28/2008 17:33:09;0080;PBS_Server;Req;req_reject;Reject reply
code=15070(Server could not connect to MOM), aux=0, type=ModifyJob,
from root at sched-00
08/28/2008 17:33:09;0008;PBS_Server;Job;101.sched-00;Job Modified at
request of root at sched-00
Jobs stay queued and checkjob shows:
BLOCK MSG: job hold active - Batch (recorded at last scheduling
iteration)
Message[0] job cannot be started on RM sched-00.scs.lbl.gov - cannot
set hostlist: cannot set job '101.sched-00' attr
'Resource_List:neednodes' to 'n0000.ikea:ppn=4+n0001.ikea:ppn=4' (rc:
15070 'Server could not connect to MOM')
Message[1] cannot start job on reserved resources - job cannot be
started on RM sched-00 - cannot set hostlist: cannot set job
'101.sched-00' attr 'Resource_List:neednodes' to 'n0000.ikea:ppn=4
+n0001.ikea:ppn=4' (rc: 15070 'Server could not connect to MOM')
Any help would be much appreciated.
Thanks
Jackie
More information about the torqueusers
mailing list