[torqueusers] Semaphores limit per job/user in torque?
dbeer at adaptivecomputing.com
Mon Sep 23 14:50:47 MDT 2013
So are the processes for these completed jobs still existent on the nodes,
or is the issue that exiting the process doesn't guarantee a release of the
If the problem is that the processes are still there, I would look into the
reaver script from pbs tools, or using cpusets for your jobs. Another
common way of attacking this is by attempting to clean up user processes in
an epilogue script. I would recommend the first two options over this one
On Mon, Sep 23, 2013 at 11:19 AM, Andrew Savchenko <bircoph at gmail.com>wrote:
> Hello David,
> On Mon, 23 Sep 2013 09:56:13 -0600 David Beer wrote:
> > Andrew,
> > Can you be more specific about what you mean when you say semaphores?
> I mean System V IPC semaphores, they can be seen via ipcs -s and
> their system wide limit is controlled via /proc/sys/kernel/sem.
> > On Sat, Sep 21, 2013 at 3:09 PM, Andrew Savchenko <bircoph at gmail.com>
> > > Hello,
> > >
> > > is it possible to limit or isolate semaphores per job or user at
> > > worker node in torque?
> > >
> > > At our cluster we have a problem with buggy user jobs which left
> > > semaphores behind leading to semaphore limit exhaustion. While limit
> > > may be lifted, this is not a proper solution since it will be reached
> > > again later. ATM we a running cron job using some heuristics to
> > > determine which semaphores are safe to clear. But this is still
> > > nothing but a workaround.
> > >
> > > The proper way is to isolate job or at least user IPC namespace on
> > > nodes. This can be done using IPC namespace kernel feature, though I
> > > don't know if torque is capable of this or any other ways to control
> > > job's IPC.
> > >
> > > ATM we're using torque-3.0.6, though if 4.x branch is capable of this
> > > feature, it will be a good reason to migrate.
> Best regards,
> Andrew Savchenko
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Senior Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers