[torqueusers] epilogue script runs twice
jenos at ncsa.uiuc.edu
Thu Mar 25 12:09:41 MDT 2010
Since my workaround to prevent multiple epilogues from running started
functioning properly, it has flushed out another major problem. Of the
multiple epilogues launched which race to create a lockfile or exit (my
workaround), apparently not all are equal. I was having terrible
intermittent problems with my epilogue sequence. It eventually traced
down to the fact that I use the $PBS_NODEFILE environment in the
epilogue sequence. Some epilogues have it, some don't!! ??
So depending on which of the multiple epilogues got canceled or got the
lockfile, I may or may not have a failure.
On 3/18/2010 5:16 PM, Jeremy Enos wrote:
> I have my workaround working (exiting the extra conflicting epilogue
> scripts) properly now. I still consider this a serious bug, since I
> wouldn't have had to go through this runaround otherwise. I'm aware
> of a few other people that are negatively impacted by this as well.
> I'll post a bug when I can.
> On 3/15/2010 5:38 PM, Jeremy Enos wrote:
>> This seemed to kind of die here, but my problem has not.
>> If I understand correctly, the description of the design purpose
>> (previous epilogue attempt fails, so it tries again), then no two
>> epilogues for the same job should ever run simultaneously. Yet they
>> do. So perhaps I'm seeing a different issue than the described logic
>> which is intentional.
>> I've also tried unsuccessfully to "lock" the first epilogue in place,
>> and abort if that lock is already in place. I'm doing this via the
>> lockfile utility- and for whatever reason, it's not effective in
>> preventing multiple epilogues to launch simultaneously for the same job.
>> Let me explain why it's important for me that this doesn't happen- in
>> the epilogue, I run a health check on a GPU resource which has a
>> failure condition if the device is inaccessible. I'm getting loads
>> of false positive detections simply because the device /is/
>> inaccessible while another epilogue is running a health check
>> already. I can't seem to get effective logic in place to prevent
>> this from happening (I already check ps info for epilogue processes
>> launched against the given jobid, and it's only partially
>> effective). My only option is to disable my health check altogether
>> to prevent the false positive detection due to conflicting epilogues.
>> I want and expect a single epilogue (or epilogue.parallel) instance
>> per job per node, as the documentation describes. Why is this
>> behavior not considered a bug??
>> On 2/3/2010 5:49 PM, Jeremy Enos wrote:
>>> Ok- so there is design behind it. I have two epilogues trampling
>>> each other. What is giving Torque the indication that a job exit
>>> failed? In other words, what constitutes a job exit failure?
>>> Perhaps that's where I should be looking to correct this.
>>> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>>>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>>>> that I shouldn't have to. Unless of course this behavior is by design
>>>>> and not an oversight, and if that's the case- I'd be curious to know why.
>>>> Because the previous job exit failed and it needs to be done again.
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers