[torqueusers] Walltime ellapsed
chrisjob.fr at gmail.com
Mon Dec 21 01:38:47 MST 2009
2009/12/19 <Gareth.Williams at csiro.au>:
> From: chris job.fr [chrisjob.fr at gmail.com]
> Sent: Friday, 18 December 2009 7:49 PM
> To: Joshua Bernstein
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Walltime ellapsed
>> These are MPI jobs.
>> The problem is that when the maxwalltime of the queue is
> reached, the job doesn't stop immediatly. The job is considered as
> finished for TORQUE but stay on nodes. New job is sent on these nodes.
> In general one node will be down after a moment.
> Sounds like torque is partly doing the right thing but the processes are not behaving nicely and going away when torque cancels the job. This could be because the job or mpi is badly behaved... Perhaps you could hunt down and kill processes in an epilogue/epilogue.parallel script. This has been discussed here a number of times before. It's relatively straightforward if users are allocated whole nodes (just kill all the user processes), but trickier if jobs are allowed to share nodes.
More information about the torqueusers