[torqueusers] epilogue script runs twice

Jeremy Enos jenos at ncsa.uiuc.edu
Thu Mar 18 16:16:26 MDT 2010


Update-
I have my workaround working (exiting the extra conflicting epilogue 
scripts) properly now.  I still consider this a serious bug, since I 
wouldn't have had to go through this runaround otherwise.  I'm aware of 
a few other people that are negatively impacted by this as well.  I'll 
post a bug when I can.

     Jeremy

On 3/15/2010 5:38 PM, Jeremy Enos wrote:
> This seemed to kind of die here, but my problem has not.
>
> If I understand correctly, the description of the design purpose 
> (previous epilogue attempt fails, so it tries again), then no two 
> epilogues for the same job should ever run simultaneously.  Yet they 
> do.  So perhaps I'm seeing a different issue than the described logic 
> which is intentional.
>
> I've also tried unsuccessfully to "lock" the first epilogue in place, 
> and abort if that lock is already in place.  I'm doing this via the 
> lockfile utility- and for whatever reason, it's not effective in 
> preventing multiple epilogues to launch simultaneously for the same job.
>
> Let me explain why it's important for me that this doesn't happen- in 
> the epilogue, I run a health check on a GPU resource which has a 
> failure condition if the device is inaccessible.  I'm getting loads of 
> false positive detections simply because the device /is/ inaccessible 
> while another epilogue is running a health check already.  I can't 
> seem to get effective logic in place to prevent this from happening (I 
> already check ps info for epilogue processes launched against the 
> given jobid, and it's only partially effective).  My only option is to 
> disable my health check altogether to prevent the false positive 
> detection due to conflicting epilogues.
>
> I want and expect a single epilogue (or epilogue.parallel) instance 
> per job per node, as the documentation describes.  Why is this 
> behavior not considered a bug??
>
>     Jeremy
>
> On 2/3/2010 5:49 PM, Jeremy Enos wrote:
>> Ok- so there is design behind it.  I have two epilogues trampling 
>> each other.  What is giving Torque the indication that a job exit 
>> failed?  In other words, what constitutes a job exit failure?  
>> Perhaps that's where I should be looking to correct this.
>> thx-
>>
>>     Jeremy
>>
>>
>> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>>    
>>>> that I shouldn't have to.  Unless of course this behavior is by design
>>>> and not an oversight, and if that's the case- I'd be curious to know why.
>>>>      
>>> Because the previous job exit failed and it needs to be done again.
>>>
>>>    
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>    
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>    
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/25911466/attachment.html 


More information about the torqueusers mailing list