[torqueusers] epilogue script runs twice

Jeremy Enos jenos at ncsa.uiuc.edu
Mon Mar 15 16:38:54 MDT 2010

This seemed to kind of die here, but my problem has not.

If I understand correctly, the description of the design purpose 
(previous epilogue attempt fails, so it tries again), then no two 
epilogues for the same job should ever run simultaneously.  Yet they 
do.  So perhaps I'm seeing a different issue than the described logic 
which is intentional.

I've also tried unsuccessfully to "lock" the first epilogue in place, 
and abort if that lock is already in place.  I'm doing this via the 
lockfile utility- and for whatever reason, it's not effective in 
preventing multiple epilogues to launch simultaneously for the same job.

Let me explain why it's important for me that this doesn't happen- in 
the epilogue, I run a health check on a GPU resource which has a failure 
condition if the device is inaccessible.  I'm getting loads of false 
positive detections simply because the device /is/ inaccessible while 
another epilogue is running a health check already.  I can't seem to get 
effective logic in place to prevent this from happening (I already check 
ps info for epilogue processes launched against the given jobid, and 
it's only partially effective).  My only option is to disable my health 
check altogether to prevent the false positive detection due to 
conflicting epilogues.

I want and expect a single epilogue (or epilogue.parallel) instance per 
job per node, as the documentation describes.  Why is this behavior not 
considered a bug??


On 2/3/2010 5:49 PM, Jeremy Enos wrote:
> Ok- so there is design behind it.  I have two epilogues trampling each 
> other.  What is giving Torque the indication that a job exit failed?  
> In other words, what constitutes a job exit failure?  Perhaps that's 
> where I should be looking to correct this.
> thx-
>     Jeremy
> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>> that I shouldn't have to.  Unless of course this behavior is by design
>>> and not an oversight, and if that's the case- I'd be curious to know why.
>> Because the previous job exit failed and it needs to be done again.
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100315/c4c32f50/attachment.html 

More information about the torqueusers mailing list