[torqueusers] Jobs in Q state forever (Torque 4.2.5, Maui 3.3.1)

David Beer dbeer at adaptivecomputing.com
Mon Oct 14 09:49:35 MDT 2013


Gus,

I would try to qterm the server and then restart it without editing the
nodes file to see if that clears it. My guess is it will. It might be
interesting to see a tracejob output for this stuck job.

David


On Fri, Oct 11, 2013 at 5:10 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Thank you David
>
> No, I am not moving jobs to another server.
> We have two other clusters running Torque 2.4.11 and Maui
> but they are separate.
>
> I think I found the reason for most of this trouble.
> To my surprise, two nodes were running triplicate pbs_mom daemons.
> I don't know how this funny situation came to be,
> probably during my attempts to fix-it-while-in-operation.
> This was totally unintended of course (ie. they're not multi-mom nodes).
> However, this seems to have made the server veeery confused.
>
> I rebooted the two nodes (hard reboot was needed).
> After that my test jobs are running, not stuck in Q state.
>
> However, the server has a sticky record of a zombie
> job in one of those nodes that doesn't want to go away.
> The job is not even in the queue anymore.
> I purged it with qdel.
> Momctl doesn't show any job on that node (see below).
> However, the server continues to show it in that node record,
> in the output of pbsnodes.
> See below, please.
>
> I put that node offline for now.
> I tried to clean up that sticky job with
> qdel -p and qdel -c to no avail.
> I rebooted the node, tried pbsnodes -r node34, etc, nothing worked.
>
> I am about to remove the node from the nodes file,
> restart the server, then insert the node in the nodes file again,
> and restart the server again, as a brute-force attempt to
> make the server "forget" about that sticky job.
>
> Is there a simple/better way to get rid of that sticky job?
>
> I enclose below  how the server shows the node, etc.
>
> Thank you for your help,
> Gus Correa
>
> *********************************************************
> # pbsnodes node34
> node34
>       state = offline
>       np = 32
>       properties = MHz2300,prod
>       ntype = cluster
>       jobs =
>
> 0/229.master,1/229.master,2/229.master,3/229.master,4/229.master,5/229.master,6/229.master,7/229.master,8/229.master,9/229.master,10/229.master,11/229.master,12/229.master,13/229.master,14/229.master,15/229.master,16/229.master,17/229.master,18/229.master,19/229.master,20/229.master,21/229.master,22/229.master,23/229.master,24/229.master,25/229.master,26/229.master,27/229.master,28/229.master,29/229.master,30/229.master,31/229.master
>       status =
>
> rectime=1381531868,varattr=,jobs=,state=free,netload=1523770,gres=,loadave=0.04,ncpus=32,physmem=132137996kb,availmem=146532668kb,totmem=147513348kb,idletime=5446,nusers=0,nsessions=0,uname=Linux
> node34 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013
> x86_64,opsys=linux
>       mom_service_port = 15002
>       mom_manager_port = 15003
>
> ************************************************
>
> [root at node34 ~]# /opt/torque/active/sbin/momctl -d 3
>
> Host: node34/node34   Version: 4.2.5   PID: 2528
> Server[0]: master (10.10.1.100:15001)
>    Last Msg From Server:   6409 seconds (CLUSTER_ADDRS)
>    Last Msg To Server:     6439 seconds
> HomeDirectory:          /opt/torque/active/mom_priv
> stdout/stderr spool directory: '/opt/torque/active/spool/'
> (3092039blocks available)
> NOTE:  syslog enabled
> MOM active:             6409 seconds
> Check Poll Time:        45 seconds
> Server Update Interval: 45 seconds
> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    TCP
> MemLocked:              TRUE  (mlock)
> TCP Timeout:            60 seconds
> Prolog:                 /opt/torque/active/mom_priv/prologue (disabled)
> Alarm Time:             0 of 10 seconds
> Trusted Client List:
> 10.10.1.1:15003,10.10.1.2:15003,10.10.1.3:15003,10.10.1.4:15003,
> 10.10.1.5:15003,10.10.1.6:15003,10.10.1.7:15003,10.10.1.8:15003,
> 10.10.1.9:15003,10.10.1.10:15003,10.10.1.11:15003,10.10.1.12:15003,
> 10.10.1.13:15003,10.10.1.14:15003,10.10.1.15:15003,10.10.1.16:15003,
> 10.10.1.17:15003,10.10.1.18:15003,10.10.1.19:15003,10.10.1.20:15003,
> 10.10.1.21:15003,10.10.1.22:15003,10.10.1.23:15003,10.10.1.24:15003,
> 10.10.1.25:15003,10.10.1.26:15003,10.10.1.27:15003,10.10.1.28:15003,
> 10.10.1.29:15003,10.10.1.30:15003,10.10.1.31:15003,10.10.1.32:15003,
> 10.10.1.33:15003,10.10.1.34:0,10.10.1.34:15003,10.10.1.100:0,127.0.0.1:0:
>   0
> Copy Command:           /usr/bin/scp -rpB
> NOTE:  no local jobs detected
>
> diagnostics complete
>
>
> *****************************************
>
> # qstat 229
> qstat: Unknown Job Id Error 229.master
>
> **********************************************
>
> On 10/11/2013 01:41 PM, David Beer wrote:
> > Gus,
> >
> > That is a really strange situation.
> >
> > The error
> >
> > Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086) in
> > svr_dequejob, Job has no queue
> >
> > can't happen around running a job. This is related to a job getting
> > routed or moved to a remote server. Are you doing this? Can you provide
> > a sequence of events that lead to this error?
> >
> > The other errors:
> > Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
> > send_job_work, child failed in previous commit request for job 228.master
> >
> > can happen during any type of job move: running a job, routing it, or
> > moving it to a remote server. However, in most cases there should be an
> > error message before this that provides more information about what the
> > failure was. Have you looked through the entire log file around these
> > messages to try to find the root cause of the problem?
> >
> > As far as the question about compatibility - 4.2.6 will resolve the
> > issue with pbs_sched and there is no intention to break compatibility
> > with Maui.
> >
> > I'm not sure if the problem you're having is related to what kind of
> > scheduler you are using or what the root issue is at this point.
> >
>
> I also don't know if Maui plays any role on this.
> I was just afraid it might.
> Currently Maui has the standard boilerplate configuration,
> I only added the maui user to the ADMIN1 line.
>
> I just ran an interactive job as a regular user.
> The job appeared in R state on qstat,
> but I never received the prompt back from the node,
> until I forced it to run with qrun (as root, of course).
> When I finished the job, logging out of the node,
> I've got two pairs of identical emails from Torque, each
> duplicate numbered with the same job number (229).
>
> No, no, there are no duplicate pbs_server running, only one,
> ps shows that.
> So, something is really wedged.
>
> If there is any additional diagnostic information that I can
> provide, please let me know.  I'll be happy to send.
>
> Thank you,
> Gus
>
>
> >
> > On Fri, Oct 11, 2013 at 10:22 AM, Gus Correa <gus at ldeo.columbia.edu
> > <mailto:gus at ldeo.columbia.edu>> wrote:
> >
> >     Dear Torque experts
> >
> >     I installed Torque 4.2.5 and Maui 3.3.1 in this cluster.
> >     For a few days it worked, but now I get jobs stalled in Q state
> >     that only run when forced by qrun.
> >
> >     I get these syslog error messages on the server,
> >     repeated time and again:
> >
> >
> **************************************************************************
> >     Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found (15086)
> in
> >     svr_dequejob, Job has no queue
> >     Oct 11 04:34:20 master pbs_server: LOG_ERROR::Time out (15085) in
> >     send_job_work, child failed in previous commit request for job
> >     219.master
> >     Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
> >     send_job_work, child failed in previous commit request for job
> >     228.master
> >
> >     ...
> >
> >     Oct 11 05:31:07 master pbs_server: LOG_ERROR::Batch protocol error
> >     (15033) in send_job_work, child failed in previous commit request for
> >     job 219.master
> >     Oct 11 05:53:07 master pbs_server: LOG_ERROR::Batch protocol error
> >     (15033) in send_job_work, child failed in previous commit request for
> >     job 228.master
> >     ...
> >
> **************************************************************************
> >
> >     And here are the jobs forever in Q state:
> >
> >     qstat 219 228
> >     Job ID                    Name             User            Time Use
> >     S Queue
> >     ------------------------- ---------------- --------------- --------
> >     - -----
> >     219.master                 GC.Base.1981.01  ltmurray               0
> Q
> >     production
> >     228.master                 g1ms290_lg_1     sw2526                 0
> Q
> >     production
> >
> >     ************
> >
> >     I already restarted pbs_mom and trqauthd on the nodes,
> >     restarted pbs_server, trquauthd and maui on the server,
> >     repeated the routine many times and nothing seems to help.
> >     I even rebooted the nodes, to no avail.
> >
> >     At this point the machine is already in production, so
> >     playing hard ball this way with the nodes is a real pain
> >     for me and for the users and their jobs.
> >
> >     Questions:
> >
> >     1) What is wrong?
> >
> >     2) Should I downgrade to the old (hopefully reliable) Torque 2.5.X?
> >
> >     3) We know that Torque 4.X.Y currently doesn't work with pbs_sched.
> >     Does it work with Maui at least?
> >     Or only with Moab these days?
> >
> >     Thank you,
> >     Gus Correa
> >     _______________________________________________
> >     torqueusers mailing list
> >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Senior Software Engineer
> > Adaptive Computing
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131014/9ea8b164/attachment-0001.html 


More information about the torqueusers mailing list