[torqueusers] Jobs in Q state forever (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Tue Oct 15 13:18:10 MDT 2013


Thank you, David!

The tracejob output for job 229 is enclosed.
However, maybe there is more relevant
information in what I found later.

The jobs were stuck again in Q state.
The same reason as before: one node has triple pbs_mom
daemons running again.
See:

[root at master ~]# ssh node33 'service pbs_mom status'
pbs_mom (pid 12971 12969 2569) is running...

***

Awkwardly, a regular user owns two of those daemons.
Moreover, the PPID of those rogue pbs_mom daemons is the
legitimate daemon.
See:

[root at node33 ~]# ps -ef |grep pbs_mom
root      2569     1  0 Oct11 ?        00:14:42 
/opt/torque/active/sbin/pbs_mom -q -d /opt/torque/active
ltmurray 12969  2569  0 Oct14 ?        00:00:00 
/opt/torque/active/sbin/pbs_mom -q -d /opt/torque/active
ltmurray 12971  2569  0 Oct14 ?        00:00:00 
/opt/torque/active/sbin/pbs_mom -q -d /opt/torque/active
root     13206 13017  0 13:56 pts/0    00:00:00 grep pbs_mom

Note also the "-q" flag, which I didn't expect.

***

This user is launching jobs with dependencies (-W),
in case this matters.
His job scripts look legit, at first sight at least.

***

Here are my guesses for possible causes of
multiple pbs_mom daemons.
However, you may have a better insight, of course:

1) Permissions:

Permissions in $TORQUE/sbin are 755 (including pbs_mom).
Should I remove execute permissions for regular users
(754, 750, 700 ?), or would this break something else in Torque?

2) The inid.d/pbs_mom script:

My init.d/pbs_mom script (Red Hat/CentOS style),
was copied/edited from the Torque 4.2.5 "contrib/pbs_mom.in".
It has these (original) lines:

*************************
if [ -z "$previous" ];then
    # being run manually, don't disturb jobs
    args="$args -p"
else
    args="$args -q"
fi
**************************

What does the  "$previous" variable stand for?
There are NO further references to "$previous"
inside the init/pbs_mom script, so apparently it is undefined.
Note that the variable "args" is not initialized either.

In addition, my pbs_mom daemons end up running with the "-q" switch,
which is not what I expected to happen.
According to the pbs_mom man page,
the default after Torque version 2.4.0 is "-p".

Is something amiss, or is the man page wrong?
Is the contrib/init.d/pbs_mom.in script buggy perhaps?

On the other hand,
in an older cluster (Torque 2.4.11) I had something different
(and working correctly):

****************************
args=""
if [ -z "$previous" ];then
    # being run manually, don't disturb jobs
    args="-p"
fi
***********************************

Note that here "args" is initialized,
and "-q" is not even in the script.

Of course, I could use the second form in the init.d/pbs_mom script,
to force launching pbs_mom with "-p".
However, I wonder if it would fix the problem of multiple pbs_mom
daemons.

Thank you for your help,
Gus Correa



****** tracejob output *********************************


[root at master ~]# tracejob -n 7 229
/opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131015: No such 
file or directory
/opt/torque/4.2.5/gnu-4.4.7/server_logs/20131015: No matching job 
records located
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131015: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131015: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131014: No matching 
job records located
/opt/torque/4.2.5/gnu-4.4.7/server_logs/20131014: No matching job 
records located
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131014: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131014: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131013: No matching 
job records located
/opt/torque/4.2.5/gnu-4.4.7/server_logs/20131013: No matching job 
records located
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131013: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131013: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131012: No matching 
job records located
/opt/torque/4.2.5/gnu-4.4.7/server_logs/20131012: No matching job 
records located
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131012: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131012: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131011: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131011: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131010: No matching 
job records located
/opt/torque/4.2.5/gnu-4.4.7/server_logs/20131010: No matching job 
records located
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131010: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131010: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131009: No matching 
job records located
/opt/torque/4.2.5/gnu-4.4.7/server_logs/20131009: No matching job 
records located
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131009: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131009: No such file or directory

Job: 229.master

10/11/2013 12:40:40  S    enqueuing into production, state 1 hop 1
10/11/2013 12:40:40  A    queue=production
10/11/2013 12:40:41  S    Job Run at request of maui at master
10/11/2013 12:41:36  S    Job Run at request of root at master
10/11/2013 12:41:37  A    user=gus group=gus jobname=STDIN 
queue=production ctime=1381509640 qtime=1381509640 etime=1381509640 
start=1381509697 owner=gus at master
 
exec_host=node01/0+node01/1+node01/2+node01/3+node01/4+node01/5+node01/6+node01/7+node01/8+node01/9+node01/10+node01/11+node01/12+node01/13+node01/14+node01/15+node01/16+node01/17+node01/18+node01/19+node01/20+node01/21+node01/22+node01/23+node01/24+node01/25+node01/26+node01/27+node01/28+node01/29+node01/30+node01/31
                           Resource_List.neednodes=1:ppn=32 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=32 
Resource_List.walltime=12:00:00
10/11/2013 12:41:50  S    Exit_status=265 resources_used.cput=00:00:00 
resources_used.mem=1908kb resources_used.vmem=112864kb 
resources_used.walltime=00:00:13 Error_Path=/dev/pts/0 
Output_Path=/dev/pts/0
10/11/2013 12:41:50  S    on_job_exit valid pjob: 229.master (substate=50)
10/11/2013 12:41:50  A    user=gus group=gus jobname=STDIN 
queue=production ctime=1381509640 qtime=1381509640 etime=1381509640 
start=1381509697 owner=gus at master
 
exec_host=node01/0+node01/1+node01/2+node01/3+node01/4+node01/5+node01/6+node01/7+node01/8+node01/9+node01/10+node01/11+node01/12+node01/13+node01/14+node01/15+node01/16+node01/17+node01/18+node01/19+node01/20+node01/21+node01/22+node01/23+node01/24+node01/25+node01/26+node01/27+node01/28+node01/29+node01/30+node01/31
                           Resource_List.neednodes=1:ppn=32 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=32 
Resource_List.walltime=12:00:00 session=4700 end=1381509710 
Exit_status=265 resources_used.cput=00:00:00 resources_used.mem=1908kb 
resources_used.vmem=112864kb
                           resources_used.walltime=00:00:13 
Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
10/11/2013 12:42:23  S    send of job to node34 failed error = 15033
10/11/2013 12:42:23  S    unable to run job, MOM rejected/rc=-1
10/11/2013 12:42:23  S    unable to run job, send to MOM '168427810' failed
10/11/2013 12:42:24  S    Job Run at request of maui at master
10/11/2013 12:42:24  S    Exit_status=-1 resources_used.cput=00:00:00 
resources_used.mem=0kb resources_used.vmem=0kb 
resources_used.walltime=00:00:00 Error_Path=/dev/pts/0 
Output_Path=/dev/pts/0
10/11/2013 12:42:24  S    on_job_exit valid pjob: 229.master (substate=50)
10/11/2013 12:42:24  A    user=gus group=gus jobname=STDIN 
queue=production ctime=1381509640 qtime=1381509640 etime=1381509640 
start=1381509744 owner=gus at master
 
exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
                           Resource_List.neednodes=1:ppn=32 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=32 
Resource_List.walltime=12:00:00
10/11/2013 12:42:24  A    user=gus group=gus jobname=STDIN 
queue=production ctime=1381509640 qtime=1381509640 etime=1381509640 
start=1381509744 owner=gus at master
 
exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
                           Resource_List.neednodes=1:ppn=32 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=32 
Resource_List.walltime=12:00:00 session=0 end=1381509744 Exit_status=-1 
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
                           resources_used.walltime=00:00:00 
Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
10/11/2013 14:40:48  S    Request invalid for state of job COMPLETE
10/11/2013 14:41:26  S    purging job 229.master without checking MOM
10/11/2013 14:41:26  S    dequeuing from production, state COMPLETE
10/11/2013 14:44:48  S    Unknown Job Id Error
10/11/2013 15:03:54  S    Unknown Job Id Error

***************************************************************




On 10/14/2013 11:49 AM, David Beer wrote:
> Gus,
>
> I would try to qterm the server and then restart it without editing the
> nodes file to see if that clears it. My guess is it will. It might be
> interesting to see a tracejob output for this stuck job.
>
> David
>
>
> On Fri, Oct 11, 2013 at 5:10 PM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Thank you David
>
>     No, I am not moving jobs to another server.
>     We have two other clusters running Torque 2.4.11 and Maui
>     but they are separate.
>
>     I think I found the reason for most of this trouble.
>     To my surprise, two nodes were running triplicate pbs_mom daemons.
>     I don't know how this funny situation came to be,
>     probably during my attempts to fix-it-while-in-operation.
>     This was totally unintended of course (ie. they're not multi-mom nodes).
>     However, this seems to have made the server veeery confused.
>
>     I rebooted the two nodes (hard reboot was needed).
>     After that my test jobs are running, not stuck in Q state.
>
>     However, the server has a sticky record of a zombie
>     job in one of those nodes that doesn't want to go away.
>     The job is not even in the queue anymore.
>     I purged it with qdel.
>     Momctl doesn't show any job on that node (see below).
>     However, the server continues to show it in that node record,
>     in the output of pbsnodes.
>     See below, please.
>
>     I put that node offline for now.
>     I tried to clean up that sticky job with
>     qdel -p and qdel -c to no avail.
>     I rebooted the node, tried pbsnodes -r node34, etc, nothing worked.
>
>     I am about to remove the node from the nodes file,
>     restart the server, then insert the node in the nodes file again,
>     and restart the server again, as a brute-force attempt to
>     make the server "forget" about that sticky job.
>
>     Is there a simple/better way to get rid of that sticky job?
>
>     I enclose below  how the server shows the node, etc.
>
>     Thank you for your help,
>     Gus Correa
>
>     *********************************************************
>     # pbsnodes node34
>     node34
>            state = offline
>            np = 32
>            properties = MHz2300,prod
>            ntype = cluster
>            jobs =
>     0/229.master,1/229.master,2/229.master,3/229.master,4/229.master,5/229.master,6/229.master,7/229.master,8/229.master,9/229.master,10/229.master,11/229.master,12/229.master,13/229.master,14/229.master,15/229.master,16/229.master,17/229.master,18/229.master,19/229.master,20/229.master,21/229.master,22/229.master,23/229.master,24/229.master,25/229.master,26/229.master,27/229.master,28/229.master,29/229.master,30/229.master,31/229.master
>            status =
>     rectime=1381531868,varattr=,jobs=,state=free,netload=1523770,gres=,loadave=0.04,ncpus=32,physmem=132137996kb,availmem=146532668kb,totmem=147513348kb,idletime=5446,nusers=0,nsessions=0,uname=Linux
>     node34 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013
>     x86_64,opsys=linux
>            mom_service_port = 15002
>            mom_manager_port = 15003
>
>     ************************************************
>
>     [root at node34 ~]# /opt/torque/active/sbin/momctl -d 3
>
>     Host: node34/node34   Version: 4.2.5   PID: 2528
>     Server[0]: master (10.10.1.100:15001 <http://10.10.1.100:15001>)
>         Last Msg From Server:   6409 seconds (CLUSTER_ADDRS)
>         Last Msg To Server:     6439 seconds
>     HomeDirectory:          /opt/torque/active/mom_priv
>     stdout/stderr spool directory: '/opt/torque/active/spool/'
>     (3092039blocks available)
>     NOTE:  syslog enabled
>     MOM active:             6409 seconds
>     Check Poll Time:        45 seconds
>     Server Update Interval: 45 seconds
>     LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>     Communication Model:    TCP
>     MemLocked:              TRUE  (mlock)
>     TCP Timeout:            60 seconds
>     Prolog:                 /opt/torque/active/mom_priv/prologue (disabled)
>     Alarm Time:             0 of 10 seconds
>     Trusted Client List:
>     10.10.1.1:15003 <http://10.10.1.1:15003>,10.10.1.2:15003
>     <http://10.10.1.2:15003>,10.10.1.3:15003
>     <http://10.10.1.3:15003>,10.10.1.4:15003
>     <http://10.10.1.4:15003>,10.10.1.5:15003
>     <http://10.10.1.5:15003>,10.10.1.6:15003
>     <http://10.10.1.6:15003>,10.10.1.7:15003
>     <http://10.10.1.7:15003>,10.10.1.8:15003
>     <http://10.10.1.8:15003>,10.10.1.9:15003
>     <http://10.10.1.9:15003>,10.10.1.10:15003
>     <http://10.10.1.10:15003>,10.10.1.11:15003
>     <http://10.10.1.11:15003>,10.10.1.12:15003
>     <http://10.10.1.12:15003>,10.10.1.13:15003
>     <http://10.10.1.13:15003>,10.10.1.14:15003
>     <http://10.10.1.14:15003>,10.10.1.15:15003
>     <http://10.10.1.15:15003>,10.10.1.16:15003
>     <http://10.10.1.16:15003>,10.10.1.17:15003
>     <http://10.10.1.17:15003>,10.10.1.18:15003
>     <http://10.10.1.18:15003>,10.10.1.19:15003
>     <http://10.10.1.19:15003>,10.10.1.20:15003
>     <http://10.10.1.20:15003>,10.10.1.21:15003
>     <http://10.10.1.21:15003>,10.10.1.22:15003
>     <http://10.10.1.22:15003>,10.10.1.23:15003
>     <http://10.10.1.23:15003>,10.10.1.24:15003
>     <http://10.10.1.24:15003>,10.10.1.25:15003
>     <http://10.10.1.25:15003>,10.10.1.26:15003
>     <http://10.10.1.26:15003>,10.10.1.27:15003
>     <http://10.10.1.27:15003>,10.10.1.28:15003
>     <http://10.10.1.28:15003>,10.10.1.29:15003
>     <http://10.10.1.29:15003>,10.10.1.30:15003
>     <http://10.10.1.30:15003>,10.10.1.31:15003
>     <http://10.10.1.31:15003>,10.10.1.32:15003
>     <http://10.10.1.32:15003>,10.10.1.33:15003
>     <http://10.10.1.33:15003>,10.10.1.34:0
>     <http://10.10.1.34:0>,10.10.1.34:15003
>     <http://10.10.1.34:15003>,10.10.1.100:0
>     <http://10.10.1.100:0>,127.0.0.1:0 <http://127.0.0.1:0>:
>        0
>     Copy Command:           /usr/bin/scp -rpB
>     NOTE:  no local jobs detected
>
>     diagnostics complete
>
>
>     *****************************************
>
>     # qstat 229
>     qstat: Unknown Job Id Error 229.master
>
>     **********************************************
>
>     On 10/11/2013 01:41 PM, David Beer wrote:
>      > Gus,
>      >
>      > That is a really strange situation.
>      >
>      > The error
>      >
>      > Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found
>     (15086) in
>      > svr_dequejob, Job has no queue
>      >
>      > can't happen around running a job. This is related to a job getting
>      > routed or moved to a remote server. Are you doing this? Can you
>     provide
>      > a sequence of events that lead to this error?
>      >
>      > The other errors:
>      > Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
>      > send_job_work, child failed in previous commit request for job
>     228.master
>      >
>      > can happen during any type of job move: running a job, routing it, or
>      > moving it to a remote server. However, in most cases there should
>     be an
>      > error message before this that provides more information about
>     what the
>      > failure was. Have you looked through the entire log file around these
>      > messages to try to find the root cause of the problem?
>      >
>      > As far as the question about compatibility - 4.2.6 will resolve the
>      > issue with pbs_sched and there is no intention to break compatibility
>      > with Maui.
>      >
>      > I'm not sure if the problem you're having is related to what kind of
>      > scheduler you are using or what the root issue is at this point.
>      >
>
>     I also don't know if Maui plays any role on this.
>     I was just afraid it might.
>     Currently Maui has the standard boilerplate configuration,
>     I only added the maui user to the ADMIN1 line.
>
>     I just ran an interactive job as a regular user.
>     The job appeared in R state on qstat,
>     but I never received the prompt back from the node,
>     until I forced it to run with qrun (as root, of course).
>     When I finished the job, logging out of the node,
>     I've got two pairs of identical emails from Torque, each
>     duplicate numbered with the same job number (229).
>
>     No, no, there are no duplicate pbs_server running, only one,
>     ps shows that.
>     So, something is really wedged.
>
>     If there is any additional diagnostic information that I can
>     provide, please let me know.  I'll be happy to send.
>
>     Thank you,
>     Gus
>
>
>      >
>      > On Fri, Oct 11, 2013 at 10:22 AM, Gus Correa
>     <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>      > <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>> wrote:
>      >
>      >     Dear Torque experts
>      >
>      >     I installed Torque 4.2.5 and Maui 3.3.1 in this cluster.
>      >     For a few days it worked, but now I get jobs stalled in Q state
>      >     that only run when forced by qrun.
>      >
>      >     I get these syslog error messages on the server,
>      >     repeated time and again:
>      >
>      >
>     **************************************************************************
>      >     Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found
>     (15086) in
>      >     svr_dequejob, Job has no queue
>      >     Oct 11 04:34:20 master pbs_server: LOG_ERROR::Time out (15085) in
>      >     send_job_work, child failed in previous commit request for job
>      >     219.master
>      >     Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
>      >     send_job_work, child failed in previous commit request for job
>      >     228.master
>      >
>      >     ...
>      >
>      >     Oct 11 05:31:07 master pbs_server: LOG_ERROR::Batch protocol
>     error
>      >     (15033) in send_job_work, child failed in previous commit
>     request for
>      >     job 219.master
>      >     Oct 11 05:53:07 master pbs_server: LOG_ERROR::Batch protocol
>     error
>      >     (15033) in send_job_work, child failed in previous commit
>     request for
>      >     job 228.master
>      >     ...
>      >
>     **************************************************************************
>      >
>      >     And here are the jobs forever in Q state:
>      >
>      >     qstat 219 228
>      >     Job ID                    Name             User
>       Time Use
>      >     S Queue
>      >     ------------------------- ---------------- ---------------
>     --------
>      >     - -----
>      >     219.master                 GC.Base.1981.01  ltmurray
>            0 Q
>      >     production
>      >     228.master                 g1ms290_lg_1     sw2526
>            0 Q
>      >     production
>      >
>      >     ************
>      >
>      >     I already restarted pbs_mom and trqauthd on the nodes,
>      >     restarted pbs_server, trquauthd and maui on the server,
>      >     repeated the routine many times and nothing seems to help.
>      >     I even rebooted the nodes, to no avail.
>      >
>      >     At this point the machine is already in production, so
>      >     playing hard ball this way with the nodes is a real pain
>      >     for me and for the users and their jobs.
>      >
>      >     Questions:
>      >
>      >     1) What is wrong?
>      >
>      >     2) Should I downgrade to the old (hopefully reliable) Torque
>     2.5.X?
>      >
>      >     3) We know that Torque 4.X.Y currently doesn't work with
>     pbs_sched.
>      >     Does it work with Maui at least?
>      >     Or only with Moab these days?
>      >
>      >     Thank you,
>      >     Gus Correa
>      >     _______________________________________________
>      >     torqueusers mailing list
>      > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >
>      >
>      > --
>      > David Beer | Senior Software Engineer
>      > Adaptive Computing
>      >
>      >
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list