[torqueusers] Jobs in Q state forever (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Fri Oct 11 10:22:27 MDT 2013


Dear Torque experts

I installed Torque 4.2.5 and Maui 3.3.1 in this cluster.
For a few days it worked, but now I get jobs stalled in Q state
that only run when forced by qrun.

I get these syslog error messages on the server,
repeated time and again:

**************************************************************************
Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086) in 
svr_dequejob, Job has no queue
Oct 11 04:34:20 master pbs_server: LOG_ERROR::Time out (15085) in 
send_job_work, child failed in previous commit request for job 219.master
Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in 
send_job_work, child failed in previous commit request for job 228.master

...

Oct 11 05:31:07 master pbs_server: LOG_ERROR::Batch protocol error 
(15033) in send_job_work, child failed in previous commit request for 
job 219.master
Oct 11 05:53:07 master pbs_server: LOG_ERROR::Batch protocol error 
(15033) in send_job_work, child failed in previous commit request for 
job 228.master
...
**************************************************************************

And here are the jobs forever in Q state:

qstat 219 228
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
219.master                 GC.Base.1981.01  ltmurray               0 Q 
production
228.master                 g1ms290_lg_1     sw2526                 0 Q 
production

************

I already restarted pbs_mom and trqauthd on the nodes,
restarted pbs_server, trquauthd and maui on the server,
repeated the routine many times and nothing seems to help.
I even rebooted the nodes, to no avail.

At this point the machine is already in production, so
playing hard ball this way with the nodes is a real pain
for me and for the users and their jobs.

Questions:

1) What is wrong?

2) Should I downgrade to the old (hopefully reliable) Torque 2.5.X?

3) We know that Torque 4.X.Y currently doesn't work with pbs_sched.
Does it work with Maui at least?
Or only with Moab these days?

Thank you,
Gus Correa


More information about the torqueusers mailing list