[torqueusers] Walltime ellapsed

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Fri Dec 18 17:01:42 MST 2009

From: chris job.fr [chrisjob.fr at gmail.com]
Sent: Friday, 18 December 2009 7:49 PM
To: Joshua Bernstein
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Walltime ellapsed

> These are MPI jobs.
>      The problem is that when the maxwalltime of the queue is
reached, the job doesn't stop immediatly. The job is considered as
finished for TORQUE but stay on nodes. New job is sent on these nodes.
In general one node will be down after a moment.

Sounds like torque is partly doing the right thing but the processes are not behaving nicely and going away when torque cancels the job.  This could be because the job or mpi is badly behaved... Perhaps you could hunt down and kill processes in an epilogue/epilogue.parallel script.  This has been discussed here a number of times before.  It's relatively straightforward if users are allocated whole nodes (just kill all the user processes), but trickier if jobs are allowed to share nodes.



More information about the torqueusers mailing list