[torqueusers] Problem with torque's connection timeout handling.
Ake.Sandgren at hpc2n.umu.se
Thu Nov 4 10:14:45 MST 2004
On Thu, Nov 04, 2004 at 10:31:39AM -0700, Wightman wrote:
> We have been working with a few other sites regarding this problem. We
> are looking for the cleanest solution.
> Would it be useful to have a parameter called MINRESTARTTIME which would
> be the time Moab would wait before it tries to start the job again?
> This could give the resource manager time to start the job up
> successfully and eventually report it back to Maui.
> If this sounds helpful, please let us know and we will implement that
> parameter in Maui/Moab.
I don't think that will help, at leasts not while using pbs_runjob.
(Haven't check what a change to pbs_asyrunjob would do)
If PBSD_Commit doesn't wait for the mom reply the pbs_server will think
that the job is queued not running and when the Obit suddenly comes in
before the MINRESTARTTIME it will think that the Obit is in error and
things will get weird after that.
If i force PBSD_Commit to wait until the mom replies i get the following
chain of messages.
(This was done on a 2 node system, one pbs_server, one pbs_mom, one job)
Maui pbs_server pbs_serv_fork pbs_mom
runjob -> prepare
fork -> prepare
PBSD_Commit -> start job
rdrpy <- send rply
(no receive) <- send_reply
Start new iteration
statnode -> prepare
rdrpy (which now gets the runjob reply from above)
(which is a CHOICE_NULL)
so ClusterQuery fails
Sees that it needs a PBSInitialize
gets (or not?) data from statnode reply
Here maui stalls forever
To me this looks like, "scrap all comm code and start from scratch".
The communication state machine is simply wrong (or doesn't exist in the
The assumption that a timeout is a failure is not a good one.
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
More information about the torqueusers