[torqueusers] epilogue script runs twice
Jeremy Enos
jenos at ncsa.uiuc.edu
Tue Jun 15 07:17:00 MDT 2010
More on this:
Sometimes epilogues are supposed to be called with certain arguments:
#argv[1] job id
#argv[2] job execution user name
#argv[3] job execution group name
#argv[4] job name
#argv[5] session id
#argv[6] list of requested resource limits
#argv[7] list of resources used by job
#argv[8] job execution queue
#argv[9] job account
When multiple epilogues run, sometimes it's without any of these args.
Is this expected behavior?
thx-
Jeremy
On 3/25/2010 1:09 PM, Jeremy Enos wrote:
> Update:
>
> Since my workaround to prevent multiple epilogues from running started
> functioning properly, it has flushed out another major problem. Of
> the multiple epilogues launched which race to create a lockfile or
> exit (my workaround), apparently not all are equal. I was having
> terrible intermittent problems with my epilogue sequence. It
> eventually traced down to the fact that I use the $PBS_NODEFILE
> environment in the epilogue sequence. Some epilogues have it, some
> don't!! ??
> So depending on which of the multiple epilogues got canceled or got
> the lockfile, I may or may not have a failure.
>
> Jeremy
>
> On 3/18/2010 5:16 PM, Jeremy Enos wrote:
>> Update-
>> I have my workaround working (exiting the extra conflicting epilogue
>> scripts) properly now. I still consider this a serious bug, since I
>> wouldn't have had to go through this runaround otherwise. I'm aware
>> of a few other people that are negatively impacted by this as well.
>> I'll post a bug when I can.
>>
>> Jeremy
>>
>> On 3/15/2010 5:38 PM, Jeremy Enos wrote:
>>> This seemed to kind of die here, but my problem has not.
>>>
>>> If I understand correctly, the description of the design purpose
>>> (previous epilogue attempt fails, so it tries again), then no two
>>> epilogues for the same job should ever run simultaneously. Yet they
>>> do. So perhaps I'm seeing a different issue than the described
>>> logic which is intentional.
>>>
>>> I've also tried unsuccessfully to "lock" the first epilogue in
>>> place, and abort if that lock is already in place. I'm doing this
>>> via the lockfile utility- and for whatever reason, it's not
>>> effective in preventing multiple epilogues to launch simultaneously
>>> for the same job.
>>>
>>> Let me explain why it's important for me that this doesn't happen-
>>> in the epilogue, I run a health check on a GPU resource which has a
>>> failure condition if the device is inaccessible. I'm getting loads
>>> of false positive detections simply because the device /is/
>>> inaccessible while another epilogue is running a health check
>>> already. I can't seem to get effective logic in place to prevent
>>> this from happening (I already check ps info for epilogue processes
>>> launched against the given jobid, and it's only partially
>>> effective). My only option is to disable my health check altogether
>>> to prevent the false positive detection due to conflicting epilogues.
>>>
>>> I want and expect a single epilogue (or epilogue.parallel) instance
>>> per job per node, as the documentation describes. Why is this
>>> behavior not considered a bug??
>>>
>>> Jeremy
>>>
>>> On 2/3/2010 5:49 PM, Jeremy Enos wrote:
>>>> Ok- so there is design behind it. I have two epilogues trampling
>>>> each other. What is giving Torque the indication that a job exit
>>>> failed? In other words, what constitutes a job exit failure?
>>>> Perhaps that's where I should be looking to correct this.
>>>> thx-
>>>>
>>>> Jeremy
>>>>
>>>>
>>>> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>>>>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>>>>
>>>>>> that I shouldn't have to. Unless of course this behavior is by design
>>>>>> and not an oversight, and if that's the case- I'd be curious to know why.
>>>>>>
>>>>> Because the previous job exit failed and it needs to be done again.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100615/1e559a36/attachment.html
More information about the torqueusers
mailing list