[torqueusers] dead mom deamon

Nicolas Ferré nicolas.ferre at univ-provence.fr
Fri Jun 6 05:12:25 MDT 2008


Hi,

Sometimes, the mom daemon ( 2.3.1-snap.200804241117) dies without any notice
(from what I can see in the log file). In the server log, I can see:

06/06/2008 10:24:09;0004;PBS_Server;Svr;svr_connect;attempting connect to
server 2472458686 port 15002
06/06/2008 10:24:09;0001;PBS_Server;Svr;PBS_Server;Operation now in progress
(115) in send_job, send_job commit failed, rc=15031 (End of File)
06/06/2008 10:24:09;0001;PBS_Server;Svr;PBS_Server;Operation now in progress
(115) in send_job, child commit request timed-out for job
9849.epstein.up.univ-mrs.fr, increase tcp_timeout?
06/06/2008 10:24:09;0004;PBS_Server;Svr;svr_connect;attempting connect to
server 2472458686 port 15002
06/06/2008 10:24:09;0004;PBS_Server;Svr;svr_connect;cannot connect to server
port 15002 - cannot establish connection (cannot bind to port 1023 in
client_to_svr - connection refused) - time=0 seconds
06/06/2008 10:24:09;0004;PBS_Server;Svr;WARNING;ALERT: unable to contact
node epstein
06/06/2008 10:24:09;0100;PBS_Server;Req;;Type ModifyJob request received
from root at epstein.up.univ-mrs.fr, sock=10
06/06/2008 10:24:09;0008;PBS_Server;Job;9849.epstein.up.univ-mrs.fr;Job
Modified at request of root at epstein.up.univ-mrs.fr
06/06/2008 10:24:09;0004;PBS_Server;Svr;svr_connect;attempting connect to
server 2472458686 port 15002
06/06/2008 10:24:09;0004;PBS_Server;Svr;svr_connect;cannot connect to server
port 15002 - cannot establish connection (cannot bind to port 1023 in
client_to_svr - connection refused) - time=0 seconds
06/06/2008 10:24:09;0001;PBS_Server;Req;;Server could not connect to MOM

What shall I do to diagnose the problem ?

Another thing: when the mom daemon is dead, jobs already running continue to
run. However, if I restart the mom daemon, they are killed immediately, and
placed in queue as if they never run before. How can I insure that running
jobs continue to run when the mom daemon is restarted ?

Nicolas Ferré,
CRCMM (Marseille, France)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080606/e33379c7/attachment.html


More information about the torqueusers mailing list