[torqueusers] removal of "stray jobs"

Christopher Samuel samuel at unimelb.edu.au
Mon Dec 10 16:08:32 MST 2012

Hash: SHA1

On 10/12/12 20:28, Lech Nieroda wrote:

> we are currently running Torque 4.1.3 with Maui 3.3.1. The option 
> "mom_job_sync" is on. However, we get "stray" jobs quite often - 
> these are jobs that remain in an "EXITING" state for whatever 
> reason and their <jobid>.JB files are often left lying around.

We see this also on our IBM iDataplex running RHEL 5.8 with Torque
2.4.x, though not on our SGI Altix XE running CentOS 5.8 with the
exact same build (install tree rsync'd from SGI -> IBM).

I suspect it's something due to the different mix of users on the two
systems, but it's proved impossible to pin down, other than to note
that it always seems to affect jobs where pbs_server has sent a second
message to start a job on a node, resulting in a log message of a
successful start followed by a log message saying that it rejected
another attempt.  For example:

11/09/2012 13:36:00;0008;
pbs_mom;Job;449269-923.merri-m.pcf.vlsci.unimelb.edu.au;JOIN JOB as node 1
11/09/2012 13:36:00;0008;
task started, tid 2, sid 27290, cmd orted
11/09/2012 13:36:01;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Success (0)
in req_quejob, cannot queue new job, job exists and is running
11/09/2012 13:36:01;0080;   pbs_mom;Req;req_reject;Reject reply
code=15009(Job with requested ID already exists MSG=job is running),
aux=0, type=QueueJob, from PBS_Server at merri-m.pcf.vlsci.unimelb.edu.au

I logged it in Bugzilla here:


- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/


More information about the torqueusers mailing list