[torqueusers] Torque/maui jobs terminated prematurely
Robin Humble
rjh at cita.utoronto.ca
Mon Sep 19 16:16:51 MDT 2005
On Wed, Sep 14, 2005 at 01:57:55PM +0100, Baker D.J. wrote:
>As anyone in the torque/maui community ever seen this sort of behaviour,
>and does anyone have an feel for what might be happening, please? It's
>almost a Unix "limit" is coming into force, or perhaps the torque system
>is getting confused and setting a cpu limit.
we saw a similar problem (except with stacksize) on an AS4 cluster.
pbs_mom was getting the root user's default 8M stack size limit, and
any jobs started up (via batch script, lamboot (tm interface), mpirun)
inherited the pbs_mom's stack limits and ignored any limit/ulimit
commands that were set in the users dotfiles, or in the batch script.
I don't know whether this is a bug in torque or not.
the workaround was to put
ulimit -n 32768 # an OSCAR default?
ulimit -s unlimited # crank up stacksize
in /etc/init.d/pbs_mom so that these limits are raised before the
pbs_mom daemon is started.(*)
you might like to run the attached trivial python script via your batch
system to check limits.
another workaround (that will probably only work with LAM) is to
mpirun C script
instead of
mpirun C yourExecutable
and then put limit/ulimit commands into 'script' before it runs 'yourExecutable'
cheers,
robin
(*) however raising the stacksize seems to have broken i/o from the
Intel fortran compiler which for some reason buffers i/o onto the stack
until it runs out of space and only then flushes it to disk.
More information about the torqueusers
mailing list