[torqueusers] Semaphores limit per job/user in torque?
mmoore at ucar.edu
Mon Sep 23 10:27:08 MDT 2013
On 09/23/2013 09:56 AM, David Beer wrote:
> Can you be more specific about what you mean when you say semaphores?
> On Sat, Sep 21, 2013 at 3:09 PM, Andrew Savchenko <bircoph at gmail.com <mailto:bircoph at gmail.com>> wrote:
> is it possible to limit or isolate semaphores per job or user at
> worker node in torque?
> At our cluster we have a problem with buggy user jobs which left
> semaphores behind leading to semaphore limit exhaustion. While limit
> may be lifted, this is not a proper solution since it will be reached
> again later. ATM we a running cron job using some heuristics to
> determine which semaphores are safe to clear. But this is still
> nothing but a workaround.
> The proper way is to isolate job or at least user IPC namespace on
> nodes. This can be done using IPC namespace kernel feature, though I
> don't know if torque is capable of this or any other ways to control
> job's IPC.
> ATM we're using torque-3.0.6, though if 4.x branch is capable of this
> feature, it will be a good reason to migrate.
> Best regards,
> Andrew Savchenko
Hmmmm....the problem is that the semaphores continue to take up memory
long after the job is finished, dead, killed, etc. Then the next job
comes along, creates more semaphores, and continues to take up space.
Fast forward through a few more jobs and the IPC buffer space becomes
We hit this with the Intel license compiler checkout of all things.
This really has to be addressed at the system level. Expecting user code
to solve this really isn't practical: users have no idea that semaphores
are being left hanging around, and if a job crashes there is no expectation
of clean up that can occur.
I finally wrote a short epilogue script (willing to share) to clean
things out after each job completes. We haven't had a problem, since.
UCAR/NCAR/CGD mmoore at ucar.edu
1850 Table Mesa Drive (W) 303 497-1338
Boulder, CO 80305 (F) 303 497-1324
More information about the torqueusers