[torquedev] [Bug 81] New: Timeouts caused by hanging Disconnect requests

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Sep 15 09:47:29 MDT 2010


           Summary: Timeouts caused by hanging Disconnect requests
           Product: TORQUE
           Version: 2.4.x
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: SimonT at mail.muni.cz
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

When a disconnect request is prefaced by a run job request the disconnect will
hang until the send job fork finishes (because the fork still holds the closed

This is specifically true for qsub and can lead to a state when no interactive
jobs can be run.

* qsub tries to disconnect and hangs because sendjob still holds the socket
* mom receives the jobs and tries to contact the qsub
* send job is done on server and exits
* qsub is finally unlocked because noone holds the socket anymore
* mom has a timeout on the read request for term type
* qsub is ready to talk to mom

Therefore each forked child on server should close all connections (well,
except those related to the processed request).

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list