[torquedev] Race conditions in IM_ protocol.

Ken Nielson knielson at adaptivecomputing.com
Thu Jun 10 07:04:11 MDT 2010


This is what I would expect if a prologue fails. Why do you think it is a race condition.


----- Original Message -----
From: "\"Mgr. Šimon Tóth\"" <SimonT at mail.muni.cz>
To: "Torque Dev. Mailing List" <torquedev at supercluster.org>
Sent: Thursday, June 10, 2010 6:58:09 AM
Subject: [torquedev] Race conditions in IM_ protocol.

As I have diverged from the upstream a lot I'm not sure if this hasn't
been actually fixed, but I have found race conditions in the IM_

Specifically, when IM_JOIN fails due to one of the prologs returning
non-zero value, this is what happens:

- sister: reports system error and purges the job
- master: exec_bail is run, sending IM_ABORT to all sisters
- master: exec_bail sets job into EXITING substate
- master: scan_for_exiting sends obit to server
- master: callback for the obit sets the job substate into OBIT
- sister: receives IM_ABORT, doesn't find the job (already purged)
- sister: reports error
- master: receives error for IM_ABORT and switches the job into EXITING
substate - everything: fails

-- Mgr. Šimon Tóth

_______________________________________________ torquedev mailing list
torquedev at supercluster.org

More information about the torquedev mailing list