[torquedev] [Bug 81] New: Timeouts caused by hanging Disconnect requests

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Sep 15 09:47:29 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=81

           Summary: Timeouts caused by hanging Disconnect requests
           Product: TORQUE
           Version: 2.4.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: SimonT at mail.muni.cz
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


When a disconnect request is prefaced by a run job request the disconnect will
hang until the send job fork finishes (because the fork still holds the closed
socket).

This is specifically true for qsub and can lead to a state when no interactive
jobs can be run.

* qsub tries to disconnect and hangs because sendjob still holds the socket
* mom receives the jobs and tries to contact the qsub
* send job is done on server and exits
* qsub is finally unlocked because noone holds the socket anymore
* mom has a timeout on the read request for term type
* qsub is ready to talk to mom

Therefore each forked child on server should close all connections (well,
except those related to the processed request).

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list