[torqueusers] torque/maui hanging bug(?)

Craig Macdonald craigm at dcs.gla.ac.uk
Mon Aug 9 20:27:00 MDT 2010


Please try enabling nscd on the machine running torque pbs_server and/or 
maui.

Craig
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 9 Aug 2010 14:14:09 -0500
> From: Will Nolan<will at headlandstech.com>
> Subject: [torqueusers] torque/maui hanging bug(?)
> To: "torqueusers at supercluster.org"<torqueusers at supercluster.org>
> Message-ID:
> 	<C90590B7DD87F44EB5EC479B822C996205A7EFAEF7 at 34093-MBX-C05.mex07a.mlsrvr.com>
> 	
> Content-Type: text/plain; charset="us-ascii"
>
> Using torque-2.6.0-snap.201008061539 and maui-3.3, I encountered some strange behavior when scheduling jobs where the maui scheduler would get "hung up" on communication with the server.  I finally tracked it down to this message in the maui log file:
>
> INFO:     starting iteration 50
> MRMGetInfo()
> MClusterClearUsage()
> MRMClusterQuery()
> MPBSClusterQuery(abc.xyz.com,RCount,SC)
> ERROR:    cannot get node info: NULL
>
> The behavior observed is that after some length of time (several minutes), finally maui is able to continue and then begins scheduling jobs again.
> I should mention that nscd is running on both machines, that had solved an earlier problem.  From previous Google searches I noticed a few folks had encountered this problem, but my guess is it's not usually noticed as anyone with relatively long-running jobs would have no idea that the scheduler had gotten hung up.  The only way we noticed it was because we were testing a fairly intensive set of short-running jobs that we expected to finish soon.
>
> I was able to reproduce this problem fairly regularly, so I attached to maui with gdb and found some code that I believe is responsible.  It turns out this code is in torque's src/lib/Libifl/pbsD_connect.c, around line 900:
>
>    if ((encode_DIS_ReqHdr(sock, PBS_BATCH_Disconnect, pbs_current_user) == 0)&&
>        (DIS_tcp_wflush(sock) == 0))
>      {
>      int atime;
>
>      struct sigaction act;
>
>      struct sigaction oldact;
>
>      /* set alarm to break out of potentially infinite read */
>
>      act.sa_handler = SIG_IGN;
>      sigemptyset(&act.sa_mask);
>      act.sa_flags = 0;
>      sigaction(SIGALRM,&act,&oldact);
>
>      atime = alarm(pbs_tcp_timeout);
>
>      /* NOTE:  alarm will break out of blocking read even with sigaction ignored */
>
>      while (1)
>        {
>        /* wait for server to close connection */
>
>        /* NOTE:  if read of 'sock' is blocking, request below may hang forever */
>
>        if (read(sock,&x, sizeof(x))<  1)
>          break;
>        }
>
>      alarm(atime);
>
>      sigaction(SIGALRM,&oldact, NULL);
>      }
>
> close(sock);
>
> My understanding of this is, for some reason the client is trying to disconnect from the server.  To do so, it expects to get a -1 on a read from the (blocking) socket to the server, i.e. it expects the server to close it from its end.  It sets a signal handler to effect a timeout on the read.  pbs_tcp_timeout was set to 9 (seconds) when I was attached.
>
> The comments suggesting that setting SIG_IGN for the alarm handler will still result in the blocking read being interrupted are incorrect, however.  I believe this may be implementation-specific, but it definitely is not the case on our version of Linux (fc12).  I also don't see why it would ever be reasonable to expect this to behave like this.  A simple test program proves the point:
>
> #include<stdlib.h>
> #include<stdio.h>
> #include<unistd.h>
> #include<stdint.h>
> #include<sys/types.h>
> #include<sys/stat.h>
>    #include<fcntl.h>
> #include<signal.h>
>
> void handler(int signo)
> {
>    fprintf(stderr, "Caught signal #%d\n", signo);
> }
>
>
> int main(int argc_, char **argv_)
> {
>    struct sigaction act;
>    struct sigaction oldact;
>
> //  act.sa_handler = SIG_IGN;
>    act.sa_handler = handler;
>    sigemptyset(&act.sa_mask);
>    act.sa_flags = 0;
>    sigaction(SIGALRM,&act,&oldact);
>    int atime = alarm(10);
>
>    char buf[10];
>    ssize_t br = read(0, buf, 10);
>
>    fprintf(stderr, "Broke out of read with br = %ld\n", br);
> }
>
> Run this program as-is and the read from stdin will get interrupted after 10 seconds, and the read will return -1.  However, switch the comment line to use SIG_IGN and the read will block indefinitely.
>
> I don't understand pbs_server well enough to know why takes so long to disconnect a client, but it is not unreasonable for there to be a very long delay there as it is not a high priority action.  However, I believe the code as written is incorrect, and leads to schedulers like maui which use torque's client libraries to get hung up unreasonably.  Perhaps this is also the case for pbs_sched.
>
> I made a change to our local copy of the source where I installed an empty signal handler (i.e. "void foo(int signo) {}", and set act.sa_handler = foo), along with some debugging printouts.  I recompiled torque and maui, and I was able to verify from the maui logs that the timeout now gets properly handled, and maui was able to continue gracefully.
>
> In any case, I'd like to solicit some feedback:
>
>
> -          Do the developers agree with my assessment of the problem?
>
> -          If so, are there other spots in the code that need to be fixed as well?
>
> Many thanks,
> William
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100809/f06ae929/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Mon, 9 Aug 2010 14:24:27 -0700
> From: Garrick Staples<garrick at usc.edu>
> Subject: Re: [torqueusers] torque/maui hanging bug(?)
> To: "torqueusers at supercluster.org"<torqueusers at supercluster.org>
> Message-ID:<20100809212427.GE15599 at polop.usc.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> On Mon, Aug 09, 2010 at 02:14:09PM -0500, Will Nolan alleged:
>    
>> Using torque-2.6.0-snap.201008061539 and maui-3.3, I encountered some strange behavior when scheduling jobs where the maui scheduler would get "hung up" on communication with the server.  I finally tracked it down to this message in the maui log file:
>>
>>      
> I actually brought this up recently on the torquedev list.
>
> http://www.supercluster.org/pipermail/torquedev/2010-July/002558.html
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/pgp-signature
> Size: 189 bytes
> Desc: not available
> Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100809/e4849003/attachment-0001.bin
>
> ------------------------------
>
> Message: 3
> Date: Mon, 9 Aug 2010 16:31:12 -0500
> From: Will Nolan<will at headlandstech.com>
> Subject: Re: [torqueusers] torque/maui hanging bug(?)
> To: Torque Users Mailing List<torqueusers at supercluster.org>
> Message-ID:
> 	<C90590B7DD87F44EB5EC479B822C996205A7EFAFA8 at 34093-MBX-C05.mex07a.mlsrvr.com>
> 	
> Content-Type: text/plain; charset="us-ascii"
>
>    
>> On Mon, Aug 09, 2010 at 02:14:09PM -0500, Will Nolan alleged:
>>      
>>> Using torque-2.6.0-snap.201008061539 and maui-3.3, I encountered some strange behavior when scheduling jobs where>the maui scheduler would get "hung up" on communication with the server.  I finally tracked it down to this message in>the maui log file:
>>>
>>>        
>> I actually brought this up recently on the torquedev list.
>>
>> http://www.supercluster.org/pipermail/torquedev/2010-July/002558.html
>>      
> Oops -- I probably posted to the wrong list.  Dare I crosspost and risk the wrath of the list admin?  :-)
>
> Will
>
>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 09 Aug 2010 15:53:46 -0600
> From: Ken Nielson<knielson at adaptivecomputing.com>
> Subject: Re: [torqueusers] torque/maui hanging bug(?)
> To: torqueusers at supercluster.org
> Message-ID:<4C6078EA.3010407 at adaptivecomputing.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 08/09/2010 03:31 PM, Will Nolan wrote:
>    
>>> On Mon, Aug 09, 2010 at 02:14:09PM -0500, Will Nolan alleged:
>>>
>>>        
>>>> Using torque-2.6.0-snap.201008061539 and maui-3.3, I encountered some strange behavior when scheduling jobs where>the maui scheduler would get "hung up" on communication with the server.  I finally tracked it down to this message in>the maui log file:
>>>>
>>>>
>>>>          
>>> I actually brought this up recently on the torquedev list.
>>>
>>> http://www.supercluster.org/pipermail/torquedev/2010-July/002558.html
>>>
>>>        
>> Oops -- I probably posted to the wrong list.  Dare I crosspost and risk the wrath of the list admin?  :-)
>>
>> Will
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>      
> src/lib/Libifl/pbsD_connect.c, around line 900?
>
> I would say you got the right list.  How often do you get hung? Did you
> patch your copy of the code?
>
> If you have patched your code and it works for you I suggest submitting
> a bug to www.clusterresources.com/bugzilla and posting the patch there.
>
> Ken Nielson
> Adaptive Computing
>
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 73, Issue 11
> *******************************************
>    



More information about the torqueusers mailing list