[torqueusers] Torque/maui jobs terminated prematurely

Robin Humble rjh at cita.utoronto.ca
Mon Sep 19 16:16:51 MDT 2005


On Wed, Sep 14, 2005 at 01:57:55PM +0100, Baker D.J. wrote:
>As anyone in the torque/maui community ever seen this sort of behaviour,
>and does anyone have an feel for what might be happening, please? It's
>almost a Unix "limit" is coming into force, or perhaps the torque system
>is getting confused and setting a cpu limit. 

we saw a similar problem (except with stacksize) on an AS4 cluster.

pbs_mom was getting the root user's default 8M stack size limit, and
any jobs started up (via batch script, lamboot (tm interface), mpirun)
inherited the pbs_mom's stack limits and ignored any limit/ulimit
commands that were set in the users dotfiles, or in the batch script. 

I don't know whether this is a bug in torque or not.

the workaround was to put
  ulimit -n 32768      # an OSCAR default?
  ulimit -s unlimited  # crank up stacksize
in /etc/init.d/pbs_mom so that these limits are raised before the
pbs_mom daemon is started.(*)

you might like to run the attached trivial python script via your batch
system to check limits.

another workaround (that will probably only work with LAM) is to
  mpirun C script
instead of
  mpirun C yourExecutable
and then put limit/ulimit commands into 'script' before it runs 'yourExecutable'

cheers,
robin

(*) however raising the stacksize seems to have broken i/o from the
Intel fortran compiler which for some reason buffers i/o onto the stack
until it runs out of space and only then flushes it to disk.


More information about the torqueusers mailing list