[torqueusers] pbs_mom obituary not delivered
Luc Vereecken
Luc.Vereecken at chem.kuleuven.be
Thu Dec 3 15:30:03 MST 2009
Hi All,
I'm struggling to get my torque installation running after an upgrade
of my server. I'm currently trying the the 2.4.2b1 version, and as
far as I can see, it is working fine. I can start jobs, I can run
pbsnodes and pbs_iff on the headnode and compute nodes. Except that
the jobs, once they finish, don't go out of the queue. These are the
relevant lines of the pbs_mom logs (real server name replaced by "server")
---------------
12/03/2009 22:57:19;0008; pbs_mom;Job;scan_for_terminated;entered
12/03/2009 22:57:19;0080; pbs_mom;Svr;mom_get_sample;proc_array load started
12/03/2009 22:57:19;0080; pbs_mom;n/a;mom_get_sample;proc_array
loaded - nproc=76
12/03/2009 22:57:19;0080; pbs_mom;n/a;cput_sum;proc_array loop
start - jobid = 43168.server
12/03/2009 22:57:19;0080; pbs_mom;n/a;mem_sum;proc_array loop start
- jobid = 43168.server
12/03/2009 22:57:19;0080; pbs_mom;n/a;resi_sum;proc_array loop
start - jobid = 43168.server
12/03/2009 22:57:19;0080; pbs_mom;Job;43168.server;checking job
w/subtask pid=23716 (child pid=23716)
12/03/2009 22:57:19;0080; pbs_mom;Job;43168.server;found match with
job subtask for pid=23716
12/03/2009 22:57:19;0080; pbs_mom;Req;post_epilogue;preparing obit
message for job 43168.server
12/03/2009 22:57:19;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
now in progress (115) in post_epilogue, cannot connect to port 1023
in client_to_svr - connection refused
----------------------------
The server logs don't show anything at the corresponding times, so
apparently the job obituary never reaches the server. A mom_ctl -d 3
on the compute node reveals:
---------------
Host: node0505/node0505 Version: 2.4.2b1 PID: 9913
Server[0]: 172.16.1.1 (172.16.1.1:1023)
Init Msgs Received: 2 hellos/2 cluster-addrs
Init Msgs Sent: 6 hellos
Last Msg From Server: 9 seconds (StatusJob)
Last Msg To Server: 10 seconds
HomeDirectory: /var/torque/mom_priv
stdout/stderr spool directory: '/var/torque/spool/' (35407738 blocks available)
ConfigVersion: 0
NOTE: syslog enabled
MOM active: 471487 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client
List:
172.16.2.5,172.16.2.4,172.16.2.3,172.16.2.2,172.16.2.1,172.16.2.6,172.16.3.13,172.16.3.12,172.16.3.11,172.16.3.10,172.16.3.9,172.16.3.8,172.16.3.7,172.16.3.6,172.16.3.5,172.16.3.4,172.16.5.4,172.16.5.3,172.16.5.2,172.16.5.1,172.16.4.11,172.16.4.10,172.16.4.9,172.16.4.8,172.16.4.7,172.16.4.6,172.16.4.5,172.16.4.4,172.16.4.3,172.16.4.2,172.16.4.1,172.16.3.3,172.16.3.2,172.16.3.1,172.16.1.1,172.16.5.12,172.16.5.14,172.16.5.11,172.16.5.10,172.16.5.9,172.16.5.8,172.16.5.7,172.16.5.6,172.16.5.5,127.0.0.1
Copy Command: /usr/bin/scp -rpB
job[43168.gweyring] state=OBIT sidlist=
Assigned CPU Count: 2
diagnostics complete
-----------------
i.e. the job is in the OBIT obituary state. The server IP is in the
mom Trusted Client List (172.16.1.1).
All this seems to indicate that obs_mom can't connect to port 1023 of
the pbs_server to deliver the job obituary. However, tracing the
network traffic between the relevant node and server, I do observe
traffic between the server 1023 port and the pbs_mom, so I'm not sure
why it would disallow connection from the mom to the 1023 port for
the obituary. More specific, I see the pbs_mom sending UDP status
messages containing opsys,uname, etc... to the 1023 server port, and
since the pbsnodes reflects those status messages, I am assuming that
these status messages actually make it into the server through this
port. The status is updates, as I see the idletime, and the size of
the filesystem change. Examples of "tcpdump -A" for such status messages:
----------
23:15:46.440034 IP node0505.pbs_resmom > server.1023: UDP, length 386
E..... at .@..(........:.......+4+1+42+11opsys=linux2+80uname=Linux
node0505 2.6.22.1
-----------
I don't know if pbs_server is "expecting" these messages to come in
and opens the port for it, but as far as I can tell from the traffic,
there is no communication initiation from the server, only the UDP
message from the mom followed by a 26 byte acknowledgement from the pbs_server.
I tried turning on mom_job_sync, as read on in the mailinglist for an
earlier version, but that didn't help. acl_host_enable is set to
false. This persists if I turn off the firewall on the server node.
The servernode is dual headed, and I tried to specify the "serverhost
server" in the torque.cfg file in the torque directory to force it to
use the proper eth, but as expected to no avail given that it was
already communicating on that interface for all the other traffic.
Any idea what might cause this, and more importantly, how to solve
this? I can manually delete the jobs with the momctl -c "jobID" on
the compute node, as well as the hard way on the server, but that
obviously is not the thing I want to do for every job I submit :-)
Thanks for any help,
Luc
More information about the torqueusers
mailing list