[Mauiusers] cannot accept client on PBS sched socket

Garrick Staples garrick at usc.edu
Fri Nov 11 13:13:12 MST 2005


On Fri, Nov 11, 2005 at 12:10:54PM -0500, Aquarijen alleged:
> Hi all,
> 
> I'm a newbie maui admin.  Maui seemed to be working and scheduling
> jobs according to my maui.cfg until this morning when I tried to
> restart it when I made a change to the config.  Obviously I have
> reverted to the known working config, but I still see these errors. 
> The funny thing is that jobs are still being scheduled.  Any pointers
> as to where to look for more info on this would be greatly appreciated
> or if I have missed something in the manual, please let me know. 
> Also, I am unsure of what other info might be helpful, so if I haven't
> included enough info, please let me know that too. :)
> 
> I have restarted the compute nodes and the problem continues, btw.
> 
> Maui Scheduler version 3.2.5p2 (it came with oscar).  Both pfilter and
> iptables are off on all nodes.

What version of TORQUE?

I think 3.2.5p2 is older than you really want to use with TORQUE.  I think it
was around the era of 3.2.5p10 when TORQUE/Maui got parallel
advancements that bypass the need for maui to poll the MOMs.

 
> pbsnodes -a shows all nodes are free.
> 
> pbs_server logs just show:
> 11/11/2005 11:58:19;0040;PBS_Server;Svr;b01l02;Scheduler sent command time
> 11/11/2005 11:59:19;0040;PBS_Server;Svr;b01l02;Scheduler sent command time
> 11/11/2005 12:00:19;0040;PBS_Server;Svr;b01l02;Scheduler sent command time
> 
> pbs_mom log on the first node shows:
> 11/11/2005 11:11:20;0002;   pbs_mom;Svr;Log;Log opened
> 11/11/2005 11:11:20;0002;   pbs_mom;Svr;usecp;b01l02:/home /home
> 11/11/2005 11:11:20;0002;   pbs_mom;Svr;restricted;b01l02
> 11/11/2005 11:11:20;0002;   pbs_mom;Svr;timeout;60
> 11/11/2005 11:11:20;0080;   pbs_mom;n/a;add_static;config[0] add name
> timeout value 60
> 11/11/2005 11:11:20;0002;   pbs_mom;n/a;initialize;independent
> 11/11/2005 11:11:20;0002;   pbs_mom;Svr;pbs_mom;Is up
> 11/11/2005 11:11:20;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
> 11/11/2005 11:11:20;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File
> from addr 192.168.79.251:15001
> 11/11/2005 11:11:23;0001;   pbs_mom;Svr;pbs_mom;Broken pipe (32) in
> rm_request, write request response failed: Protocol failure in commit
>         message refused from port 1022 addr 192.168.79.251
> 11/11/2005 11:11:43;0002;   pbs_mom;n/a;is_update_stat;hello sent to server

It's very reasonable to get EOF errors immediately after restarting PBS
daemons.  In this case, pbs_server still had valid connection info for
the MOM at the time MOM came up.  pbs_server must close it's end of the
connection and refuse the new connection from MOM.  With both ends now
in sync, the next connection initiated by either side will succeed.

Modern TORQUEs handle this stuff better.


> In the maui logs, I see many of these:
> 11/11 11:52:54 ALERT:    cannot accept client on PBS sched socket
> 11/11 11:52:54 MSUAcceptClient(6,ClientSD,HostName,TCP)
> 11/11 11:52:54 INFO:     accept call failed, errno: 11 (Resource
> temporarily unavailable)
> 11/11 11:52:54 INFO:     all clients connected.  servicing requests
> 11/11 11:52:54 MRMCheckEvents()
> 11/11 11:52:54 ALERT:    no PBS RPP sched socket connections ready
> 11/11 11:52:54 MSUAcceptClient(8,ClientSD,HostName,TCP)
> 11/11 11:52:54 INFO:     accept call failed, errno: 11 (Resource
> temporarily unavailable)
> 11/11 11:52:54 ALERT:    cannot accept client on PBS sched socket
> 11/11 11:52:54 MSUAcceptClient(6,ClientSD,HostName,TCP)
> 11/11 11:52:54 INFO:     accept call failed, errno: 11 (Resource
> temporarily unavailable)
> 11/11 11:52:54 INFO:     all clients connected.  servicing requests
> 
> Thanks for any information at all.  I am really at a loss here.

I'm not sure either.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20051111/877bd714/attachment.bin


More information about the mauiusers mailing list