[torqueusers] Problem with torque's connection timeout handling.

Ake Ake.Sandgren at hpc2n.umu.se
Thu Nov 4 03:52:44 MST 2004


The following is a problem that has been hitting us quite alot lately.
It boils down to a problem with how request-reply timeouts are handled.

The problem can be triggered VERY easily by hand but was found by a real
situation.

What happens is that jobs that take a bit longer then usual to start
will sometimes end up in two simultaneous instances on our cluster.

This is caused by PBSD_commit timing out in PBSD_rdrpy while pbs_mom is
trying to start an mpijob, either due to slow prologue scripts or NFS
being slow or whatever. The pbs_server then decides that job start 
failed and reports that back to maui which tries to start it somewhere
else and succeds. Mean while the original start attempt has finished and
we now have two instances running in different parts of the cluster at
the same time.

A very simple way to trigger this is to put a sleep 30 in the prologue
on all pbs_mom's.

Unfortunately the "simple" solution to let PBSD_commit loop until it
gets the reply back from the mom in the first place will hang maui at
it's next iteration in PBSInitialize. This is caused by pbs_server and
maui getting out of sync. Maui will wind up in pbs_disconnect where it
will receive a reply to the statusnode request that it sent in
PBSClusterQuery.

The only working solution that i can see to this, correct me if i'm
wrong, is a total rewrite of the request-reply handling in torque.
It feels like it needs more asyncronous io handling (threads?) and a
better state handler that doesn't timeout when it really shouldn't.

This also affects maui's PBS interface.

If you want logs of this i can generate them easily.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se	Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


More information about the torqueusers mailing list