[torqueusers] Jobs in Q state forever (Torque 4.2.5, Maui 3.3.1)

David Beer dbeer at adaptivecomputing.com
Fri Oct 11 11:41:43 MDT 2013


Gus,

That is a really strange situation.

The error

Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086) in
svr_dequejob, Job has no queue

can't happen around running a job. This is related to a job getting routed
or moved to a remote server. Are you doing this? Can you provide a sequence
of events that lead to this error?

The other errors:
Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
send_job_work, child failed in previous commit request for job 228.master

can happen during any type of job move: running a job, routing it, or
moving it to a remote server. However, in most cases there should be an
error message before this that provides more information about what the
failure was. Have you looked through the entire log file around these
messages to try to find the root cause of the problem?

As far as the question about compatibility - 4.2.6 will resolve the issue
with pbs_sched and there is no intention to break compatibility with Maui.

I'm not sure if the problem you're having is related to what kind of
scheduler you are using or what the root issue is at this point.


On Fri, Oct 11, 2013 at 10:22 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Dear Torque experts
>
> I installed Torque 4.2.5 and Maui 3.3.1 in this cluster.
> For a few days it worked, but now I get jobs stalled in Q state
> that only run when forced by qrun.
>
> I get these syslog error messages on the server,
> repeated time and again:
>
> **************************************************************************
> Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086) in
> svr_dequejob, Job has no queue
> Oct 11 04:34:20 master pbs_server: LOG_ERROR::Time out (15085) in
> send_job_work, child failed in previous commit request for job 219.master
> Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
> send_job_work, child failed in previous commit request for job 228.master
>
> ...
>
> Oct 11 05:31:07 master pbs_server: LOG_ERROR::Batch protocol error
> (15033) in send_job_work, child failed in previous commit request for
> job 219.master
> Oct 11 05:53:07 master pbs_server: LOG_ERROR::Batch protocol error
> (15033) in send_job_work, child failed in previous commit request for
> job 228.master
> ...
> **************************************************************************
>
> And here are the jobs forever in Q state:
>
> qstat 219 228
> Job ID                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 219.master                 GC.Base.1981.01  ltmurray               0 Q
> production
> 228.master                 g1ms290_lg_1     sw2526                 0 Q
> production
>
> ************
>
> I already restarted pbs_mom and trqauthd on the nodes,
> restarted pbs_server, trquauthd and maui on the server,
> repeated the routine many times and nothing seems to help.
> I even rebooted the nodes, to no avail.
>
> At this point the machine is already in production, so
> playing hard ball this way with the nodes is a real pain
> for me and for the users and their jobs.
>
> Questions:
>
> 1) What is wrong?
>
> 2) Should I downgrade to the old (hopefully reliable) Torque 2.5.X?
>
> 3) We know that Torque 4.X.Y currently doesn't work with pbs_sched.
> Does it work with Maui at least?
> Or only with Moab these days?
>
> Thank you,
> Gus Correa
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131011/3c59f244/attachment.html 


More information about the torqueusers mailing list