[torqueusers] pbs_mom obituary not delivered

Luc Vereecken Luc.Vereecken at chem.kuleuven.be
Thu Dec 3 15:30:03 MST 2009


Hi All,

I'm struggling to get my torque installation running after an upgrade 
of my server. I'm currently trying the the 2.4.2b1 version, and as 
far as I can see, it is working fine. I can start jobs, I can run 
pbsnodes and pbs_iff on the headnode and compute nodes. Except that 
the jobs, once they finish, don't go out of the queue. These are the 
relevant lines of the pbs_mom logs (real server name replaced by "server")
---------------
12/03/2009 22:57:19;0008;   pbs_mom;Job;scan_for_terminated;entered
12/03/2009 22:57:19;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
12/03/2009 22:57:19;0080;   pbs_mom;n/a;mom_get_sample;proc_array 
loaded - nproc=76
12/03/2009 22:57:19;0080;   pbs_mom;n/a;cput_sum;proc_array loop 
start - jobid = 43168.server
12/03/2009 22:57:19;0080;   pbs_mom;n/a;mem_sum;proc_array loop start 
- jobid = 43168.server
12/03/2009 22:57:19;0080;   pbs_mom;n/a;resi_sum;proc_array loop 
start - jobid = 43168.server
12/03/2009 22:57:19;0080;   pbs_mom;Job;43168.server;checking job 
w/subtask pid=23716 (child pid=23716)
12/03/2009 22:57:19;0080;   pbs_mom;Job;43168.server;found match with 
job subtask for pid=23716
12/03/2009 22:57:19;0080;   pbs_mom;Req;post_epilogue;preparing obit 
message for job 43168.server
12/03/2009 22:57:19;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation 
now in progress (115) in post_epilogue, cannot connect to port 1023 
in client_to_svr - connection refused
----------------------------

The server logs don't show anything at the corresponding times, so 
apparently the job obituary never reaches the server. A mom_ctl -d 3 
on the compute node reveals:
---------------

Host: node0505/node0505   Version: 2.4.2b1   PID: 9913
Server[0]: 172.16.1.1 (172.16.1.1:1023)
   Init Msgs Received:     2 hellos/2 cluster-addrs
   Init Msgs Sent:         6 hellos
   Last Msg From Server:   9 seconds (StatusJob)
   Last Msg To Server:     10 seconds
HomeDirectory:          /var/torque/mom_priv
stdout/stderr spool directory: '/var/torque/spool/' (35407738 blocks available)
ConfigVersion:          0
NOTE:  syslog enabled
MOM active:             471487 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               7 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /var/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client 
List: 
172.16.2.5,172.16.2.4,172.16.2.3,172.16.2.2,172.16.2.1,172.16.2.6,172.16.3.13,172.16.3.12,172.16.3.11,172.16.3.10,172.16.3.9,172.16.3.8,172.16.3.7,172.16.3.6,172.16.3.5,172.16.3.4,172.16.5.4,172.16.5.3,172.16.5.2,172.16.5.1,172.16.4.11,172.16.4.10,172.16.4.9,172.16.4.8,172.16.4.7,172.16.4.6,172.16.4.5,172.16.4.4,172.16.4.3,172.16.4.2,172.16.4.1,172.16.3.3,172.16.3.2,172.16.3.1,172.16.1.1,172.16.5.12,172.16.5.14,172.16.5.11,172.16.5.10,172.16.5.9,172.16.5.8,172.16.5.7,172.16.5.6,172.16.5.5,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
job[43168.gweyring]  state=OBIT  sidlist=
Assigned CPU Count:     2

diagnostics complete
-----------------
i.e. the job is in the OBIT obituary state. The server IP is in the 
mom Trusted Client List (172.16.1.1).

All this seems to indicate that obs_mom can't connect to port 1023 of 
the pbs_server to deliver the job obituary. However, tracing the 
network traffic between the relevant node and server, I do observe 
traffic between the server 1023 port and the pbs_mom, so I'm not sure 
why it would disallow connection from the mom to the 1023 port for 
the obituary. More specific, I see the pbs_mom sending UDP status 
messages containing opsys,uname, etc... to the 1023 server port, and 
since the pbsnodes reflects those status messages, I am assuming that 
these status messages actually make it into the server through this 
port. The status is updates, as I see the idletime, and the size of 
the filesystem change. Examples of "tcpdump -A" for such status messages:
----------
23:15:46.440034 IP node0505.pbs_resmom > server.1023: UDP, length 386
E..... at .@..(........:.......+4+1+42+11opsys=linux2+80uname=Linux 
node0505 2.6.22.1
-----------
I don't know if pbs_server is "expecting" these messages to come in 
and opens the port for it, but as far as I can tell from the traffic, 
there is no communication initiation from the server, only the UDP 
message from the mom followed by a 26 byte acknowledgement from the pbs_server.

I tried turning on mom_job_sync, as read on in the mailinglist for an 
earlier version, but that didn't help. acl_host_enable is set to 
false. This persists if I turn off the firewall on the server node. 
The servernode is dual headed, and I tried to specify the "serverhost 
server" in the torque.cfg file in the torque directory to force it to 
use the proper eth, but as expected to no avail given that it was 
already communicating on that interface for all the other traffic.

Any idea what might cause this, and more importantly, how to solve 
this? I can manually delete the jobs with the momctl -c "jobID" on 
the compute node, as well as the hard way on the server, but that 
obviously is not the thing I want to do for every job I submit :-)

Thanks for any help,
Luc



More information about the torqueusers mailing list