[torquedev] Race conditions in IM_ protocol.

Ken Nielson knielson at adaptivecomputing.com
Thu Jun 10 08:34:53 MDT 2010


Simon,

Thanks for the extra information.

Ken

----- Original Message -----
From: "\"Mgr. Šimon Tóth\"" <SimonT at mail.muni.cz>
To: "Ken Nielson" <knielson at adaptivecomputing.com>
Cc: "Torque Developers mailing list" <torquedev at supercluster.org>
Sent: Thursday, June 10, 2010 7:09:23 AM
Subject: Re: [torquedev] Race conditions in IM_ protocol.

> This is what I would expect if a prologue fails. Why do you think it
> is a race condition.

Because the job is now stuck in a wrong substate. The job is now
exiting, but obit was already sent.

> ----- Original Message -----
> From: "\"Mgr. Šimon Tóth\"" <SimonT at mail.muni.cz>
> To: "Torque Dev. Mailing List" <torquedev at supercluster.org>
> Sent: Thursday, June 10, 2010 6:58:09 AM
> Subject: [torquedev] Race conditions in IM_ protocol.
>
> As I have diverged from the upstream a lot I'm not sure if this hasn't
> been actually fixed, but I have found race conditions in the IM_
> protocol.
>
> Specifically, when IM_JOIN fails due to one of the prologs returning
> non-zero value, this is what happens:
>
> - sister: reports system error and purges the job
> - master: exec_bail is run, sending IM_ABORT to all sisters
> - master: exec_bail sets job into EXITING substate
> - master: scan_for_exiting sends obit to server
> - master: callback for the obit sets the job substate into OBIT
> - sister: receives IM_ABORT, doesn't find the job (already purged)
> - sister: reports error
> - master: receives error for IM_ABORT and switches the job into
> EXITING substate - everything: fails

-- Mgr. Šimon Tóth


More information about the torquedev mailing list