[torqueusers] Jobs in Q state forever (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Fri Oct 11 17:10:38 MDT 2013


Thank you David

No, I am not moving jobs to another server.
We have two other clusters running Torque 2.4.11 and Maui
but they are separate.

I think I found the reason for most of this trouble.
To my surprise, two nodes were running triplicate pbs_mom daemons.
I don't know how this funny situation came to be,
probably during my attempts to fix-it-while-in-operation.
This was totally unintended of course (ie. they're not multi-mom nodes).
However, this seems to have made the server veeery confused.

I rebooted the two nodes (hard reboot was needed).
After that my test jobs are running, not stuck in Q state.

However, the server has a sticky record of a zombie
job in one of those nodes that doesn't want to go away.
The job is not even in the queue anymore.
I purged it with qdel.
Momctl doesn't show any job on that node (see below).
However, the server continues to show it in that node record,
in the output of pbsnodes.
See below, please.

I put that node offline for now.
I tried to clean up that sticky job with
qdel -p and qdel -c to no avail.
I rebooted the node, tried pbsnodes -r node34, etc, nothing worked.

I am about to remove the node from the nodes file,
restart the server, then insert the node in the nodes file again,
and restart the server again, as a brute-force attempt to
make the server "forget" about that sticky job.

Is there a simple/better way to get rid of that sticky job?

I enclose below  how the server shows the node, etc.

Thank you for your help,
Gus Correa

*********************************************************
# pbsnodes node34
node34
      state = offline
      np = 32
      properties = MHz2300,prod
      ntype = cluster
      jobs = 
0/229.master,1/229.master,2/229.master,3/229.master,4/229.master,5/229.master,6/229.master,7/229.master,8/229.master,9/229.master,10/229.master,11/229.master,12/229.master,13/229.master,14/229.master,15/229.master,16/229.master,17/229.master,18/229.master,19/229.master,20/229.master,21/229.master,22/229.master,23/229.master,24/229.master,25/229.master,26/229.master,27/229.master,28/229.master,29/229.master,30/229.master,31/229.master
      status = 
rectime=1381531868,varattr=,jobs=,state=free,netload=1523770,gres=,loadave=0.04,ncpus=32,physmem=132137996kb,availmem=146532668kb,totmem=147513348kb,idletime=5446,nusers=0,nsessions=0,uname=Linux 
node34 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 
x86_64,opsys=linux
      mom_service_port = 15002
      mom_manager_port = 15003

************************************************

[root at node34 ~]# /opt/torque/active/sbin/momctl -d 3

Host: node34/node34   Version: 4.2.5   PID: 2528
Server[0]: master (10.10.1.100:15001)
   Last Msg From Server:   6409 seconds (CLUSTER_ADDRS)
   Last Msg To Server:     6439 seconds
HomeDirectory:          /opt/torque/active/mom_priv
stdout/stderr spool directory: '/opt/torque/active/spool/' 
(3092039blocks available)
NOTE:  syslog enabled
MOM active:             6409 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /opt/torque/active/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List: 
10.10.1.1:15003,10.10.1.2:15003,10.10.1.3:15003,10.10.1.4:15003,10.10.1.5:15003,10.10.1.6:15003,10.10.1.7:15003,10.10.1.8:15003,10.10.1.9:15003,10.10.1.10:15003,10.10.1.11:15003,10.10.1.12:15003,10.10.1.13:15003,10.10.1.14:15003,10.10.1.15:15003,10.10.1.16:15003,10.10.1.17:15003,10.10.1.18:15003,10.10.1.19:15003,10.10.1.20:15003,10.10.1.21:15003,10.10.1.22:15003,10.10.1.23:15003,10.10.1.24:15003,10.10.1.25:15003,10.10.1.26:15003,10.10.1.27:15003,10.10.1.28:15003,10.10.1.29:15003,10.10.1.30:15003,10.10.1.31:15003,10.10.1.32:15003,10.10.1.33:15003,10.10.1.34:0,10.10.1.34:15003,10.10.1.100:0,127.0.0.1:0: 
  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete


*****************************************

# qstat 229
qstat: Unknown Job Id Error 229.master

**********************************************

On 10/11/2013 01:41 PM, David Beer wrote:
> Gus,
>
> That is a really strange situation.
>
> The error
>
> Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086) in
> svr_dequejob, Job has no queue
>
> can't happen around running a job. This is related to a job getting
> routed or moved to a remote server. Are you doing this? Can you provide
> a sequence of events that lead to this error?
>
> The other errors:
> Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
> send_job_work, child failed in previous commit request for job 228.master
>
> can happen during any type of job move: running a job, routing it, or
> moving it to a remote server. However, in most cases there should be an
> error message before this that provides more information about what the
> failure was. Have you looked through the entire log file around these
> messages to try to find the root cause of the problem?
>
> As far as the question about compatibility - 4.2.6 will resolve the
> issue with pbs_sched and there is no intention to break compatibility
> with Maui.
>
> I'm not sure if the problem you're having is related to what kind of
> scheduler you are using or what the root issue is at this point.
>

I also don't know if Maui plays any role on this.
I was just afraid it might.
Currently Maui has the standard boilerplate configuration,
I only added the maui user to the ADMIN1 line.

I just ran an interactive job as a regular user.
The job appeared in R state on qstat,
but I never received the prompt back from the node,
until I forced it to run with qrun (as root, of course).
When I finished the job, logging out of the node,
I've got two pairs of identical emails from Torque, each
duplicate numbered with the same job number (229).

No, no, there are no duplicate pbs_server running, only one,
ps shows that.
So, something is really wedged.

If there is any additional diagnostic information that I can
provide, please let me know.  I'll be happy to send.

Thank you,
Gus


>
> On Fri, Oct 11, 2013 at 10:22 AM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Dear Torque experts
>
>     I installed Torque 4.2.5 and Maui 3.3.1 in this cluster.
>     For a few days it worked, but now I get jobs stalled in Q state
>     that only run when forced by qrun.
>
>     I get these syslog error messages on the server,
>     repeated time and again:
>
>     **************************************************************************
>     Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086) in
>     svr_dequejob, Job has no queue
>     Oct 11 04:34:20 master pbs_server: LOG_ERROR::Time out (15085) in
>     send_job_work, child failed in previous commit request for job
>     219.master
>     Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
>     send_job_work, child failed in previous commit request for job
>     228.master
>
>     ...
>
>     Oct 11 05:31:07 master pbs_server: LOG_ERROR::Batch protocol error
>     (15033) in send_job_work, child failed in previous commit request for
>     job 219.master
>     Oct 11 05:53:07 master pbs_server: LOG_ERROR::Batch protocol error
>     (15033) in send_job_work, child failed in previous commit request for
>     job 228.master
>     ...
>     **************************************************************************
>
>     And here are the jobs forever in Q state:
>
>     qstat 219 228
>     Job ID                    Name             User            Time Use
>     S Queue
>     ------------------------- ---------------- --------------- --------
>     - -----
>     219.master                 GC.Base.1981.01  ltmurray               0 Q
>     production
>     228.master                 g1ms290_lg_1     sw2526                 0 Q
>     production
>
>     ************
>
>     I already restarted pbs_mom and trqauthd on the nodes,
>     restarted pbs_server, trquauthd and maui on the server,
>     repeated the routine many times and nothing seems to help.
>     I even rebooted the nodes, to no avail.
>
>     At this point the machine is already in production, so
>     playing hard ball this way with the nodes is a real pain
>     for me and for the users and their jobs.
>
>     Questions:
>
>     1) What is wrong?
>
>     2) Should I downgrade to the old (hopefully reliable) Torque 2.5.X?
>
>     3) We know that Torque 4.X.Y currently doesn't work with pbs_sched.
>     Does it work with Maui at least?
>     Or only with Moab these days?
>
>     Thank you,
>     Gus Correa
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list