[torqueusers] epilogue script runs twice

Jeremy Enos jenos at ncsa.uiuc.edu
Tue Jun 15 07:21:24 MDT 2010


Disregard- I read my error check incorrectly (I check for several things 
in trying to find out what's going on).  This is just the same old issue 
of not having a $PBS_NODEFILE env.

     Jeremy

On 6/15/2010 8:17 AM, Jeremy Enos wrote:
> More on this:
> Sometimes epilogues are supposed to be called with certain arguments:
>
> #argv[1]        job id
> #argv[2]        job execution user name
> #argv[3]        job execution group name
> #argv[4]        job name
> #argv[5]        session id
> #argv[6]        list of requested resource limits
> #argv[7]        list of resources used by job
> #argv[8]        job execution queue
> #argv[9]        job account
>
> When multiple epilogues run, sometimes it's without any of these 
> args.  Is this expected behavior?
> thx-
>
>     Jeremy
>
> On 3/25/2010 1:09 PM, Jeremy Enos wrote:
>> Update:
>>
>> Since my workaround to prevent multiple epilogues from running 
>> started functioning properly, it has flushed out another major 
>> problem.  Of the multiple epilogues launched which race to create a 
>> lockfile or exit (my workaround), apparently not all are equal.  I 
>> was having terrible intermittent problems with my epilogue sequence.  
>> It eventually traced down to the fact that I use the $PBS_NODEFILE 
>> environment in the epilogue sequence.  Some epilogues have it, some 
>> don't!!  ??
>> So depending on which of the multiple epilogues got canceled or got 
>> the lockfile, I may or may not have a failure.
>>
>>     Jeremy
>>
>> On 3/18/2010 5:16 PM, Jeremy Enos wrote:
>>> Update-
>>> I have my workaround working (exiting the extra conflicting epilogue 
>>> scripts) properly now.  I still consider this a serious bug, since I 
>>> wouldn't have had to go through this runaround otherwise.  I'm aware 
>>> of a few other people that are negatively impacted by this as well.  
>>> I'll post a bug when I can.
>>>
>>>     Jeremy
>>>
>>> On 3/15/2010 5:38 PM, Jeremy Enos wrote:
>>>> This seemed to kind of die here, but my problem has not.
>>>>
>>>> If I understand correctly, the description of the design purpose 
>>>> (previous epilogue attempt fails, so it tries again), then no two 
>>>> epilogues for the same job should ever run simultaneously.  Yet 
>>>> they do.  So perhaps I'm seeing a different issue than the 
>>>> described logic which is intentional.
>>>>
>>>> I've also tried unsuccessfully to "lock" the first epilogue in 
>>>> place, and abort if that lock is already in place.  I'm doing this 
>>>> via the lockfile utility- and for whatever reason, it's not 
>>>> effective in preventing multiple epilogues to launch simultaneously 
>>>> for the same job.
>>>>
>>>> Let me explain why it's important for me that this doesn't happen- 
>>>> in the epilogue, I run a health check on a GPU resource which has a 
>>>> failure condition if the device is inaccessible.  I'm getting loads 
>>>> of false positive detections simply because the device /is/ 
>>>> inaccessible while another epilogue is running a health check 
>>>> already.  I can't seem to get effective logic in place to prevent 
>>>> this from happening (I already check ps info for epilogue processes 
>>>> launched against the given jobid, and it's only partially 
>>>> effective).  My only option is to disable my health check 
>>>> altogether to prevent the false positive detection due to 
>>>> conflicting epilogues.
>>>>
>>>> I want and expect a single epilogue (or epilogue.parallel) instance 
>>>> per job per node, as the documentation describes.  Why is this 
>>>> behavior not considered a bug??
>>>>
>>>>     Jeremy
>>>>
>>>> On 2/3/2010 5:49 PM, Jeremy Enos wrote:
>>>>> Ok- so there is design behind it.  I have two epilogues trampling 
>>>>> each other.  What is giving Torque the indication that a job exit 
>>>>> failed?  In other words, what constitutes a job exit failure?  
>>>>> Perhaps that's where I should be looking to correct this.
>>>>> thx-
>>>>>
>>>>>     Jeremy
>>>>>
>>>>>
>>>>> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>>>>>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>>>>>    
>>>>>>> that I shouldn't have to.  Unless of course this behavior is by design
>>>>>>> and not an oversight, and if that's the case- I'd be curious to know why.
>>>>>>>      
>>>>>> Because the previous job exit failed and it needs to be done again.
>>>>>>
>>>>>>    
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>    
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>    
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>    
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>    
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100615/214b0fc4/attachment-0001.html 


More information about the torqueusers mailing list