[Mauiusers] maui start problem

Stijn De Weirdt stijn.deweirdt at ugent.be
Sat Jul 12 11:11:28 MDT 2008


hi all,

i had some time to look a bit further into it.

the good news is that the scheduling works (and that i know that i can  
ignore the  'Resource temporarily unavailable' messages).

the bad news is that the showq (or any other maui command still fails).

strace of showq gives
...
connect(3, {sa_family=AF_INET, sin_port=htons(40559),  
sin_addr=inet_addr("192.16
8.10.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
...
sendto(3, "00000057\nCK=4fa43eb400e5e9d7  DT=CMD=showq AUTH=root ARG=0  
ALL 0 \n"
, 66, 0, NULL, 0) = 66
select(4, [3], NULL, NULL, {30, 0})     = 1 (in [3], left {29, 893000})
recvfrom(3, "", 9, 0, NULL, NULL)       = 0
write(2, "ERROR:    lost connection to server\n", 36ERROR:    lost  
connection to
  server
) = 36

strace of maui during that try gives:
...
select(10, [9], NULL, NULL, {5, 0})     = 1 (in [9], left {5, 0})
recvfrom(9, "00000057\n", 9, 0, NULL, NULL) = 9
select(10, [9], NULL, NULL, {5, 0})     = 1 (in [9], left {5, 0})
recvfrom(9, "CK=4fa43eb400e5e9d7  DT=CMD=showq AUTH=root ARG=0 ALL 0  
\n", 57, 0\
, NULL, NULL) = 57
close(9)                                = 0
...


thanks,


stijn

> symptom:
> submitted jobs stay queued, showq/checkjob commands fail.
the symptoms are not correlated.

the fact that the scheduling didn't work seems due to the following  
line in my maui.cfg (that i copied from a working setup that was using  
a previous snapshot):

SYSCFG[base] PLIST=

setting LOGLEVEL to 9 and carefully reading the important messages  
gave some hints that all teh connetcions to torque were working finem,  
but that the jobs were held by something else.


>
> (using LOGLEVEL 9):
> in /var/log/maui.log:
>
> 07/10 16:35:05 INFO:     no PBS sched socket connections ready
> 07/10 16:35:05 MSUAcceptClient(5,ClientSD,HostName,TCP)
> 07/10 16:35:05 INFO:     accept call failed, errno: 11 (Resource
> temporarily unavailable)
> 07/10 16:35:05 INFO:     all clients connected.  servicing requests

reading log files more carefully, fd 5 is the listen on port 40559,  
and the fact that nothing connects to it gives this message. (eg  
telnet localhost 40559 shows something)





More information about the mauiusers mailing list