[torquedev] [Bug 81] New: Timeouts caused by hanging Disconnect requests
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Wed Sep 15 09:47:29 MDT 2010
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=81
Summary: Timeouts caused by hanging Disconnect requests
Product: TORQUE
Version: 2.4.x
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P5
Component: pbs_server
AssignedTo: glen.beane at gmail.com
ReportedBy: SimonT at mail.muni.cz
CC: torquedev at supercluster.org
Estimated Hours: 0.0
When a disconnect request is prefaced by a run job request the disconnect will
hang until the send job fork finishes (because the fork still holds the closed
socket).
This is specifically true for qsub and can lead to a state when no interactive
jobs can be run.
* qsub tries to disconnect and hangs because sendjob still holds the socket
* mom receives the jobs and tries to contact the qsub
* send job is done on server and exits
* qsub is finally unlocked because noone holds the socket anymore
* mom has a timeout on the read request for term type
* qsub is ready to talk to mom
Therefore each forked child on server should close all connections (well,
except those related to the processed request).
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list