[torqueusers] jobs get deffered

Naveed Near-Ansari naveed at caltech.edu
Mon Jun 1 14:26:58 MDT 2009


Hi we keep having jobs get deffered on our cluster.  It seems that job
is dispatched to a node, but the node rejects the connection, defferring
the job.  typically restarting the moms on the nodes, the pbs_server,
and maui will get things moving again.  This happens quite frequently
though and we would like to get to the bottom of it.

Does anyone know what may be happening here?

We seem to get these types of messages on the compute node fro the
mom_logs:


05/28/2009 19:35:37;0008;   pbs_mom;Job;process_request;request type
QueueJob from host hostname.local rejected (host not authorized)
05/28/2009 19:35:37;0080;   pbs_mom;Req;req_reject;Reject reply
code=15008(Access from host not allowed, or unknown host MSG=request not
authorized), aux=0, type=QueueJob, from PBS_Server at neuroe
con.local
05/28/2009 19:36:47;0008;   pbs_mom;Job;process_request;request type
QueueJob from host hostname.local rejected (host not authorized)
05/28/2009 19:36:47;0080;   pbs_mom;Req;req_reject;Reject reply
code=15008(Access from host not allowed, or unknown host MSG=request not
authorized), aux=0, type=QueueJob, from PBS_Server at neuroe
con.local


here is the tracejob output:

...
05/28/2009 19:36:47  S    Job Modified at request of
maui at hostname.caltech.edu
05/28/2009 19:36:47  S    Job Run at request of
maui at hostname.caltech.edu
05/28/2009 19:36:47  S    Job Modified at request of
maui at hostname.caltech.edu
05/28/2009 19:36:47  A    queue=default
05/28/2009 19:36:47  S    send of job to compute-1-35 failed error =
15008
05/28/2009 19:36:47  S    unable to run job, MOM rejected/rc=1
....


And this is the typical momctl output the the affected node"

[root at compute-1-35 mom_logs]# momctl -h c1-35 -d 3

Host: compute-1-35.local/compute-1-35.local   Version: 2.3.6   PID: 7255
Server[0]: hostname.local (10.10.0.1:15001)
  Init Msgs Received:     0 hellos/1 cluster-addrs
  Init Msgs Sent:         3 hellos
  WARNING:  invalid attempt to connect from server 10.10.0.1:1023
(server not authorized)
  Last Msg From Server:   5646 seconds (CLUSTER_ADDRS)
  Last Msg To Server:     20 seconds
HomeDirectory:          /opt/torque/mom_priv
stdout/stderr spool directory: '/opt/torque/spool/' (10341234 blocks
available)
NOTE:  syslog enabled
MOM active:             5662 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /opt/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:
10.10.255.243,10.10.255.244,10.10.255.245,10.10.255.246,10.10.255.247,10.10.0.1,10.10.255.249,10.10.255.250,10.10.255.251,10.10.255.252,10.10.255.253,10.10.255.254,10.10.255.248,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete




More information about the torqueusers mailing list