[torqueusers] Problem with torque's connection timeout handling.

Wightman wightman at clusterresources.com
Thu Nov 4 10:31:39 MST 2004


We have been working with a few other sites regarding this problem.  We
are looking for the cleanest solution.  

Would it be useful to have a parameter called MINRESTARTTIME which would
be the time Moab would wait before it tries to start the job again? 
This could give the resource manager time to start the job up
successfully and eventually report it back to Maui.

If this sounds helpful, please let us know and we will implement that
parameter in Maui/Moab.

Thanks,

Douglas


On Thu, 2004-11-04 at 03:52, Ake wrote:
> The following is a problem that has been hitting us quite alot lately.
> It boils down to a problem with how request-reply timeouts are handled.
> 
> The problem can be triggered VERY easily by hand but was found by a real
> situation.
> 
> What happens is that jobs that take a bit longer then usual to start
> will sometimes end up in two simultaneous instances on our cluster.
> 
> This is caused by PBSD_commit timing out in PBSD_rdrpy while pbs_mom is
> trying to start an mpijob, either due to slow prologue scripts or NFS
> being slow or whatever. The pbs_server then decides that job start 
> failed and reports that back to maui which tries to start it somewhere
> else and succeds. Mean while the original start attempt has finished and
> we now have two instances running in different parts of the cluster at
> the same time.
> 
> A very simple way to trigger this is to put a sleep 30 in the prologue
> on all pbs_mom's.
> 
> Unfortunately the "simple" solution to let PBSD_commit loop until it
> gets the reply back from the mom in the first place will hang maui at
> it's next iteration in PBSInitialize. This is caused by pbs_server and
> maui getting out of sync. Maui will wind up in pbs_disconnect where it
> will receive a reply to the statusnode request that it sent in
> PBSClusterQuery.
> 
> The only working solution that i can see to this, correct me if i'm
> wrong, is a total rewrite of the request-reply handling in torque.
> It feels like it needs more asyncronous io handling (threads?) and a
> better state handler that doesn't timeout when it really shouldn't.
> 
> This also affects maui's PBS interface.
> 
> If you want logs of this i can generate them easily.



More information about the torqueusers mailing list