[torquedev] Race conditions in IM_ protocol.
Ken Nielson
knielson at adaptivecomputing.com
Thu Jun 10 07:04:11 MDT 2010
Simon,
This is what I would expect if a prologue fails. Why do you think it is a race condition.
Ken
----- Original Message -----
From: "\"Mgr. Šimon Tóth\"" <SimonT at mail.muni.cz>
To: "Torque Dev. Mailing List" <torquedev at supercluster.org>
Sent: Thursday, June 10, 2010 6:58:09 AM
Subject: [torquedev] Race conditions in IM_ protocol.
As I have diverged from the upstream a lot I'm not sure if this hasn't
been actually fixed, but I have found race conditions in the IM_
protocol.
Specifically, when IM_JOIN fails due to one of the prologs returning
non-zero value, this is what happens:
- sister: reports system error and purges the job
- master: exec_bail is run, sending IM_ABORT to all sisters
- master: exec_bail sets job into EXITING substate
- master: scan_for_exiting sends obit to server
- master: callback for the obit sets the job substate into OBIT
- sister: receives IM_ABORT, doesn't find the job (already purged)
- sister: reports error
- master: receives error for IM_ABORT and switches the job into EXITING
substate - everything: fails
-- Mgr. Šimon Tóth
_______________________________________________ torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev
More information about the torquedev
mailing list