[torquedev] Race conditions in IM_ protocol.

"Mgr. Šimon Tóth" SimonT at mail.muni.cz
Thu Jun 10 07:09:23 MDT 2010


> This is what I would expect if a prologue fails. Why do you think it is a race condition.

Because the job is now stuck in a wrong substate. The job is now
exiting, but obit was already sent.

> ----- Original Message -----
> From: "\"Mgr. Šimon Tóth\"" <SimonT at mail.muni.cz>
> To: "Torque Dev. Mailing List" <torquedev at supercluster.org>
> Sent: Thursday, June 10, 2010 6:58:09 AM
> Subject: [torquedev] Race conditions in IM_ protocol.
> 
> As I have diverged from the upstream a lot I'm not sure if this hasn't
> been actually fixed, but I have found race conditions in the IM_
> protocol.
> 
> Specifically, when IM_JOIN fails due to one of the prologs returning
> non-zero value, this is what happens:
> 
> - sister: reports system error and purges the job
> - master: exec_bail is run, sending IM_ABORT to all sisters
> - master: exec_bail sets job into EXITING substate
> - master: scan_for_exiting sends obit to server
> - master: callback for the obit sets the job substate into OBIT
> - sister: receives IM_ABORT, doesn't find the job (already purged)
> - sister: reports error
> - master: receives error for IM_ABORT and switches the job into EXITING
> substate - everything: fails

-- 
Mgr. Šimon Tóth

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3366 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100610/ab71b2e5/attachment.bin 


More information about the torquedev mailing list