[torqueusers] epilogue script runs twice
jenos at ncsa.uiuc.edu
Mon Mar 15 16:38:54 MDT 2010
This seemed to kind of die here, but my problem has not.
If I understand correctly, the description of the design purpose
(previous epilogue attempt fails, so it tries again), then no two
epilogues for the same job should ever run simultaneously. Yet they
do. So perhaps I'm seeing a different issue than the described logic
which is intentional.
I've also tried unsuccessfully to "lock" the first epilogue in place,
and abort if that lock is already in place. I'm doing this via the
lockfile utility- and for whatever reason, it's not effective in
preventing multiple epilogues to launch simultaneously for the same job.
Let me explain why it's important for me that this doesn't happen- in
the epilogue, I run a health check on a GPU resource which has a failure
condition if the device is inaccessible. I'm getting loads of false
positive detections simply because the device /is/ inaccessible while
another epilogue is running a health check already. I can't seem to get
effective logic in place to prevent this from happening (I already check
ps info for epilogue processes launched against the given jobid, and
it's only partially effective). My only option is to disable my health
check altogether to prevent the false positive detection due to
I want and expect a single epilogue (or epilogue.parallel) instance per
job per node, as the documentation describes. Why is this behavior not
considered a bug??
On 2/3/2010 5:49 PM, Jeremy Enos wrote:
> Ok- so there is design behind it. I have two epilogues trampling each
> other. What is giving Torque the indication that a job exit failed?
> In other words, what constitutes a job exit failure? Perhaps that's
> where I should be looking to correct this.
> On 2/3/2010 1:28 PM, Garrick Staples wrote:
>> On Wed, Feb 03, 2010 at 03:59:48AM -0600, Jeremy Enos alleged:
>>> that I shouldn't have to. Unless of course this behavior is by design
>>> and not an oversight, and if that's the case- I'd be curious to know why.
>> Because the previous job exit failed and it needs to be done again.
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers