[torqueusers] strange problem after unscheduled node reboot

Arnau Bria arnaubria at pic.es
Thu Mar 17 04:20:10 MDT 2011


Hi all,

this night some part of our farm has been rebooted due to a power cut.

Since them, some of them have been accepting jobs but not executing
them, so all their jobs became in W state and they became "black holes".


If I do a momctl against one of that nodes:

# momctl -h tditaller028.pic.es -d 3
ERROR:    query[0] 'diag3' failed on tditaller028.pic.es (errno=0-Success: 5-Input/output error)

but it is not marked as down...

and its logs, from the power cut, show:

#  /var/spool/pbs/mom_logs/20110317

[power cut & reboot] some network problems:
03/17/2011 03:56:43;0002;   pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1291378489
03/17/2011 03:56:43;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.3, loglevel = 0
03/17/2011 03:57:23;0002;   pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: cannot open rpp connection to pbs03.pic.es, errno=2, hostname resolution for 'pbs03.pic.es' failed, errno=2 (check /etc/hosts file?)
03/17/2011 03:57:23;0002;   pbs_mom;n/a;mom_server_check_connection;unable to establish/restore connection to server pbs03.pic.es (failcount=1, retry in 2 seconds)
03/17/2011 03:57:41;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in scan_for_exiting, 
03/17/2011 03:58:22;0002;   pbs_mom;n/a;mom_server_check_connection;unable to establish/restore connection to server pbs03.pic.es (failcount=2, retry in 4 seconds)
03/17/2011 03:58:23;0008;   pbs_mom;Job;15791881.pbs03.pic.es;job was terminated
03/17/2011 03:58:24;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No route to host (113) in scan_for_exiting, cannot connect to port 679 in client_to_svr - errno:113 No route to host
03/17/2011 03:59:05;0002;   pbs_mom;n/a;mom_server_check_connection;unable to establish/restore connection to server pbs03.pic.es (failcount=3, retry in 8 seconds)
03/17/2011 03:59:24;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in scan_for_exiting, 
03/17/2011 04:00:05;0002;   pbs_mom;n/a;mom_server_check_connection;unable to establish/restore connection to server pbs03.pic.es (failcount=4, retry in 16 seconds)
03/17/2011 04:00:24;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in scan_for_exiting, 
03/17/2011 04:01:05;0002;   pbs_mom;n/a;mom_server_check_connection;unable to establish/restore connection to server pbs03.pic.es (failcount=5, retry in 32 seconds)
03/17/2011 04:01:08;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No route to host (113) in scan_for_exiting, cannot connect to port 747 in client_to_svr - errno:113 No route to host
03/17/2011 04:01:49;0002;   pbs_mom;n/a;mom_server_check_connection;unable to establish/restore connection to server pbs03.pic.es (failcount=6, retry in 64 seconds)
03/17/2011 04:02:07;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in scan_for_exiting, 

[at some point, it can contact server again but gives strange error]:

03/17/2011 05:35:14;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.3, loglevel = 0
03/17/2011 05:36:52;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server pbs03.pic.es
m PBS_Server at pbs03.pic.es
03/17/2011 05:39:49;0008;   pbs_mom;Job;process_request;request type CopyFiles from host pbs03.pic.es rejected (host not authorized)
m PBS_Server at pbs03.pic.es
03/17/2011 05:37:10;0080;   pbs_mom;Req;req_reject;Reject reply code=15010(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=CopyFiles, fro
m PBS_Server at pbs03.pic.es

[since them till now, the log if full of above messages]

from client I can ping server:

# host pbs03.pic.es
pbs03.pic.es has address 193.109.174.13

# ping pbs03.pic.es
PING pbs03.pic.es (193.109.174.13) 56(84) bytes of data.
64 bytes from pbs03.pic.es (193.109.174.13): icmp_seq=1 ttl=62 time=0.258 ms

--- pbs03.pic.es ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.258/0.258/0.258/0.000 ms

And a mom restart seems to solve the issue (not 100% sure yet).


Master is running torque-2.5.5-1.cri.x86_64 and client 2.5.3.

Anyone have seem this error before? What does it mean?
Anyone has experienced this kind of problem before? (network problems
when starting mom makes mom work wrong).

Why, if momctl is not able to contact client's mom, torque does not set
the node down?


TIA,
Arnau


More information about the torqueusers mailing list