[torqueusers] strange problem after unscheduled node reboot

Coyle, James J [ITACD] jjc at iastate.edu
Mon Mar 21 16:46:19 MDT 2011


Arun,

  If this is not already solved, try two things:

1) check that /etc/resolv.conf did not get changed on reboot. I assume that
     you already checked /etc/hosts. 
2) if you have an internal non-routed subnet (172.16.x.y) for example
    check for any route commands you may need to execute to gateway out
    through the head node.


James Coyle, PhD
 High Performance Computing Group        
 Iowa State Univ.          
web: http://www.public.iastate.edu/~jjc


>-----Original Message-----
>From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>bounces at supercluster.org] On Behalf Of Arnau Bria
>Sent: Thursday, March 17, 2011 5:20 AM
>To: torqueusers at supercluster.org
>Subject: [torqueusers] strange problem after unscheduled node reboot
>
>Hi all,
>
>this night some part of our farm has been rebooted due to a power
>cut.
>
>Since them, some of them have been accepting jobs but not executing
>them, so all their jobs became in W state and they became "black
>holes".
>
>
>If I do a momctl against one of that nodes:
>
># momctl -h tditaller028.pic.es -d 3
>ERROR:    query[0] 'diag3' failed on tditaller028.pic.es (errno=0-
>Success: 5-Input/output error)
>
>but it is not marked as down...
>
>and its logs, from the power cut, show:
>
>#  /var/spool/pbs/mom_logs/20110317
>
>[power cut & reboot] some network problems:
>03/17/2011 03:56:43;0002;
>pbs_mom;Svr;setup_program_environment;MOM executable path and mtime
>at launch: /usr/sbin/pbs_mom 1291378489
>03/17/2011 03:56:43;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
>2.5.3, loglevel = 0
>03/17/2011 03:57:23;0002;
>pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: cannot
>open rpp connection to pbs03.pic.es, errno=2, hostname resolution
>for 'pbs03.pic.es' failed, errno=2 (check /etc/hosts file?)
>03/17/2011 03:57:23;0002;
>pbs_mom;n/a;mom_server_check_connection;unable to establish/restore
>connection to server pbs03.pic.es (failcount=1, retry in 2 seconds)
>03/17/2011 03:57:41;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
>now in progress (115) in scan_for_exiting,
>03/17/2011 03:58:22;0002;
>pbs_mom;n/a;mom_server_check_connection;unable to establish/restore
>connection to server pbs03.pic.es (failcount=2, retry in 4 seconds)
>03/17/2011 03:58:23;0008;   pbs_mom;Job;15791881.pbs03.pic.es;job
>was terminated
>03/17/2011 03:58:24;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No route
>to host (113) in scan_for_exiting, cannot connect to port 679 in
>client_to_svr - errno:113 No route to host
>03/17/2011 03:59:05;0002;
>pbs_mom;n/a;mom_server_check_connection;unable to establish/restore
>connection to server pbs03.pic.es (failcount=3, retry in 8 seconds)
>03/17/2011 03:59:24;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
>now in progress (115) in scan_for_exiting,
>03/17/2011 04:00:05;0002;
>pbs_mom;n/a;mom_server_check_connection;unable to establish/restore
>connection to server pbs03.pic.es (failcount=4, retry in 16 seconds)
>03/17/2011 04:00:24;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
>now in progress (115) in scan_for_exiting,
>03/17/2011 04:01:05;0002;
>pbs_mom;n/a;mom_server_check_connection;unable to establish/restore
>connection to server pbs03.pic.es (failcount=5, retry in 32 seconds)
>03/17/2011 04:01:08;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::No route
>to host (113) in scan_for_exiting, cannot connect to port 747 in
>client_to_svr - errno:113 No route to host
>03/17/2011 04:01:49;0002;
>pbs_mom;n/a;mom_server_check_connection;unable to establish/restore
>connection to server pbs03.pic.es (failcount=6, retry in 64 seconds)
>03/17/2011 04:02:07;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation
>now in progress (115) in scan_for_exiting,
>
>[at some point, it can contact server again but gives strange
>error]:
>
>03/17/2011 05:35:14;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
>2.5.3, loglevel = 0
>03/17/2011 05:36:52;0002;
>pbs_mom;n/a;mom_server_check_connection;sending hello to server
>pbs03.pic.es
>m PBS_Server at pbs03.pic.es
>03/17/2011 05:39:49;0008;   pbs_mom;Job;process_request;request type
>CopyFiles from host pbs03.pic.es rejected (host not authorized)
>m PBS_Server at pbs03.pic.es
>03/17/2011 05:37:10;0080;   pbs_mom;Req;req_reject;Reject reply
>code=15010(Access from host not allowed, or unknown host MSG=request
>not authorized), aux=0, type=CopyFiles, fro
>m PBS_Server at pbs03.pic.es
>
>[since them till now, the log if full of above messages]
>
>from client I can ping server:
>
># host pbs03.pic.es
>pbs03.pic.es has address 193.109.174.13
>
># ping pbs03.pic.es
>PING pbs03.pic.es (193.109.174.13) 56(84) bytes of data.
>64 bytes from pbs03.pic.es (193.109.174.13): icmp_seq=1 ttl=62
>time=0.258 ms
>
>--- pbs03.pic.es ping statistics ---
>1 packets transmitted, 1 received, 0% packet loss, time 0ms
>rtt min/avg/max/mdev = 0.258/0.258/0.258/0.000 ms
>
>And a mom restart seems to solve the issue (not 100% sure yet).
>
>
>Master is running torque-2.5.5-1.cri.x86_64 and client 2.5.3.
>
>Anyone have seem this error before? What does it mean?
>Anyone has experienced this kind of problem before? (network
>problems
>when starting mom makes mom work wrong).
>
>Why, if momctl is not able to contact client's mom, torque does not
>set
>the node down?
>
>
>TIA,
>Arnau
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list