[torqueusers] Jobs in Q state forever (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Wed Oct 16 12:01:33 MDT 2013


Thank you, David

Please, see my comments inline below.

On 10/16/2013 12:58 PM, David Beer wrote:
> Gus,
>
> When there are multiple mom daemons but some are owned by the user, this
> is because there is a job on the node. pbs_mom forks to become the job
> and sets its privileges as the user. Before doing this, it stops
> listening on all ports and won't interfere with network activity for the
> main mom.
>

OK, but this should be a transient process, only while the job
is being setup, right?
After the job reaches "steady state" the user-owned
forked pbs_mom finishes (or becomes the actual job/executable),
right?

I can see on the active nodes here that
when jobs are running in "steady state",
only the main pbs_mom is at work.
Or am I missing something?

However, what I have been getting are jobs stuck in Q state
and not one but two user-owned forked pbs_mom processes.
Are two of them expected?
In addition, the user-owned pbs_mom processes also get stuck,
with time=00:00:00 and in D state forever.

Could this aloof state be related to the fact that the pbs_mom daemons
are being launched with the "-q" flag (which is presumably
obsolete), rather than the (current) default "-p" ?

It seems that the
init.d/pbs_mom script in the Torque contrib directory is
somehow letting the "-q" sneak in.
I will change that to "-p" on my side anyway.
However, it would be interesting
to know if the "-q" might be playing a role in this problem.



> As far as the privileges go, I don't believe that changing them will
> have any effect - disclaimer: I have never tried - because pbs_mom
> checks to verify that it is root before attempting to do anything of
> importance. I'd expect that these multiple daemons are simply forked
> jobs. I'm not sure if the same happened during your original problem,
> although I'd suspect its different.
>

My concern is that a regular user,
willingly or inadvertently,
whether with a good or a bad intent,
could launch pbs_mom daemon replicas
(or even pbs_server replicas),
and mess things up, if the permissions in $TORQUE/sbin
stay as 755.
I may switch to 700 and see if it causes any harm.
The only problem is that the machine is in production,
and breaking things in production is not fun.

> Unfortunately, the tracejob output doesn't appear to shed any light on
> what exactly happened. I don't see anything really out of the ordinary
> there.
>
> David

For the time being things are working, after I restarted the
node33, where the three pbs_mom replicas were stuck.
After that, the queued job eventually ran.

FYI, when the node chokes with those three pbs_mom replicas,
I cannot shut it down "softly".  I.e. "shutdown -r now" or
"reboot" don't work.
I have to do a hard reboot.
This is awkward.
Not clear what blocks the soft shutdown.

Anyway, I will also do the qterm to try to clear that zombie 229 job
on the server, as you suggested.

Thank you for your help,
Gus


>
>
> On Tue, Oct 15, 2013 at 1:18 PM, Gus Correa <gus at ldeo.columbia.edu
> <mailto:gus at ldeo.columbia.edu>> wrote:
>
>     Thank you, David!
>
>     The tracejob output for job 229 is enclosed.
>     However, maybe there is more relevant
>     information in what I found later.
>
>     The jobs were stuck again in Q state.
>     The same reason as before: one node has triple pbs_mom
>     daemons running again.
>     See:
>
>     [root at master ~]# ssh node33 'service pbs_mom status'
>     pbs_mom (pid 12971 12969 2569) is running...
>
>     ***
>
>     Awkwardly, a regular user owns two of those daemons.
>     Moreover, the PPID of those rogue pbs_mom daemons is the
>     legitimate daemon.
>     See:
>
>     [root at node33 ~]# ps -ef |grep pbs_mom
>     root      2569     1  0 Oct11 ?        00:14:42
>     /opt/torque/active/sbin/pbs_mom -q -d /opt/torque/active
>     ltmurray 12969  2569  0 Oct14 ?        00:00:00
>     /opt/torque/active/sbin/pbs_mom -q -d /opt/torque/active
>     ltmurray 12971  2569  0 Oct14 ?        00:00:00
>     /opt/torque/active/sbin/pbs_mom -q -d /opt/torque/active
>     root     13206 13017  0 13:56 pts/0    00:00:00 grep pbs_mom
>
>     Note also the "-q" flag, which I didn't expect.
>
>     ***
>
>     This user is launching jobs with dependencies (-W),
>     in case this matters.
>     His job scripts look legit, at first sight at least.
>
>     ***
>
>     Here are my guesses for possible causes of
>     multiple pbs_mom daemons.
>     However, you may have a better insight, of course:
>
>     1) Permissions:
>
>     Permissions in $TORQUE/sbin are 755 (including pbs_mom).
>     Should I remove execute permissions for regular users
>     (754, 750, 700 ?), or would this break something else in Torque?
>
>     2) The inid.d/pbs_mom script:
>
>     My init.d/pbs_mom script (Red Hat/CentOS style),
>     was copied/edited from the Torque 4.2.5 "contrib/pbs_mom.in
>     <http://pbs_mom.in>".
>     It has these (original) lines:
>
>     *************************
>     if [ -z "$previous" ];then
>          # being run manually, don't disturb jobs
>          args="$args -p"
>     else
>          args="$args -q"
>     fi
>     **************************
>
>     What does the "$previous" variable stand for?
>     There are NO further references to "$previous"
>     inside the init/pbs_mom script, so apparently it is undefined.
>     Note that the variable "args" is not initialized either.
>
>     In addition, my pbs_mom daemons end up running with the "-q" switch,
>     which is not what I expected to happen.
>     According to the pbs_mom man page,
>     the default after Torque version 2.4.0 is "-p".
>
>     Is something amiss, or is the man page wrong?
>     Is the contrib/init.d/pbs_mom.in <http://pbs_mom.in> script buggy
>     perhaps?
>
>     On the other hand,
>     in an older cluster (Torque 2.4.11) I had something different
>     (and working correctly):
>
>     ****************************
>     args=""
>     if [ -z "$previous" ];then
>          # being run manually, don't disturb jobs
>          args="-p"
>     fi
>     ***********************************
>
>     Note that here "args" is initialized,
>     and "-q" is not even in the script.
>
>     Of course, I could use the second form in the init.d/pbs_mom script,
>     to force launching pbs_mom with "-p".
>     However, I wonder if it would fix the problem of multiple pbs_mom
>     daemons.
>
>     Thank you for your help,
>     Gus Correa
>
>
>
>     ****** tracejob output *********************************
>
>
>     [root at master ~]# tracejob -n 7 229
>     /opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131015: No such
>     file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/server_logs/20131015: No matching job
>     records located
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131015: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131015: No such file or
>     directory
>     /opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131014: No matching
>     job records located
>     /opt/torque/4.2.5/gnu-4.4.7/server_logs/20131014: No matching job
>     records located
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131014: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131014: No such file or
>     directory
>     /opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131013: No matching
>     job records located
>     /opt/torque/4.2.5/gnu-4.4.7/server_logs/20131013: No matching job
>     records located
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131013: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131013: No such file or
>     directory
>     /opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131012: No matching
>     job records located
>     /opt/torque/4.2.5/gnu-4.4.7/server_logs/20131012: No matching job
>     records located
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131012: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131012: No such file or
>     directory
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131011: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131011: No such file or
>     directory
>     /opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131010: No matching
>     job records located
>     /opt/torque/4.2.5/gnu-4.4.7/server_logs/20131010: No matching job
>     records located
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131010: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131010: No such file or
>     directory
>     /opt/torque/4.2.5/gnu-4.4.7/server_priv/accounting/20131009: No matching
>     job records located
>     /opt/torque/4.2.5/gnu-4.4.7/server_logs/20131009: No matching job
>     records located
>     /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131009: No such file or directory
>     /opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131009: No such file or
>     directory
>
>     Job: 229.master
>
>     10/11/2013 12:40:40  S    enqueuing into production, state 1 hop 1
>     10/11/2013 12:40:40  A    queue=production
>     10/11/2013 12:40:41  S    Job Run at request of maui at master
>     10/11/2013 12:41:36  S    Job Run at request of root at master
>     10/11/2013 12:41:37  A    user=gus group=gus jobname=STDIN
>     queue=production ctime=1381509640 qtime=1381509640 etime=1381509640
>     start=1381509697 owner=gus at master
>
>     exec_host=node01/0+node01/1+node01/2+node01/3+node01/4+node01/5+node01/6+node01/7+node01/8+node01/9+node01/10+node01/11+node01/12+node01/13+node01/14+node01/15+node01/16+node01/17+node01/18+node01/19+node01/20+node01/21+node01/22+node01/23+node01/24+node01/25+node01/26+node01/27+node01/28+node01/29+node01/30+node01/31
>                                 Resource_List.neednodes=1:ppn=32
>     Resource_List.nodect=1 Resource_List.nodes=1:ppn=32
>     Resource_List.walltime=12:00:00
>     10/11/2013 12:41:50  S    Exit_status=265 resources_used.cput=00:00:00
>     resources_used.mem=1908kb resources_used.vmem=112864kb
>     resources_used.walltime=00:00:13 Error_Path=/dev/pts/0
>     Output_Path=/dev/pts/0
>     10/11/2013 12:41:50  S    on_job_exit valid pjob: 229.master
>     (substate=50)
>     10/11/2013 12:41:50  A    user=gus group=gus jobname=STDIN
>     queue=production ctime=1381509640 qtime=1381509640 etime=1381509640
>     start=1381509697 owner=gus at master
>
>     exec_host=node01/0+node01/1+node01/2+node01/3+node01/4+node01/5+node01/6+node01/7+node01/8+node01/9+node01/10+node01/11+node01/12+node01/13+node01/14+node01/15+node01/16+node01/17+node01/18+node01/19+node01/20+node01/21+node01/22+node01/23+node01/24+node01/25+node01/26+node01/27+node01/28+node01/29+node01/30+node01/31
>                                 Resource_List.neednodes=1:ppn=32
>     Resource_List.nodect=1 Resource_List.nodes=1:ppn=32
>     Resource_List.walltime=12:00:00 session=4700 end=1381509710
>     Exit_status=265 resources_used.cput=00:00:00 resources_used.mem=1908kb
>     resources_used.vmem=112864kb
>                                 resources_used.walltime=00:00:13
>     Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
>     10/11/2013 12:42:23  S    send of job to node34 failed error = 15033
>     10/11/2013 12:42:23  S    unable to run job, MOM rejected/rc=-1
>     10/11/2013 12:42:23  S    unable to run job, send to MOM '168427810'
>     failed
>     10/11/2013 12:42:24  S    Job Run at request of maui at master
>     10/11/2013 12:42:24  S    Exit_status=-1 resources_used.cput=00:00:00
>     resources_used.mem=0kb resources_used.vmem=0kb
>     resources_used.walltime=00:00:00 Error_Path=/dev/pts/0
>     Output_Path=/dev/pts/0
>     10/11/2013 12:42:24  S    on_job_exit valid pjob: 229.master
>     (substate=50)
>     10/11/2013 12:42:24  A    user=gus group=gus jobname=STDIN
>     queue=production ctime=1381509640 qtime=1381509640 etime=1381509640
>     start=1381509744 owner=gus at master
>
>     exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
>                                 Resource_List.neednodes=1:ppn=32
>     Resource_List.nodect=1 Resource_List.nodes=1:ppn=32
>     Resource_List.walltime=12:00:00
>     10/11/2013 12:42:24  A    user=gus group=gus jobname=STDIN
>     queue=production ctime=1381509640 qtime=1381509640 etime=1381509640
>     start=1381509744 owner=gus at master
>
>     exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
>                                 Resource_List.neednodes=1:ppn=32
>     Resource_List.nodect=1 Resource_List.nodes=1:ppn=32
>     Resource_List.walltime=12:00:00 session=0 end=1381509744 Exit_status=-1
>     resources_used.cput=00:00:00 resources_used.mem=0kb
>     resources_used.vmem=0kb
>                                 resources_used.walltime=00:00:00
>     Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
>     10/11/2013 14:40:48  S    Request invalid for state of job COMPLETE
>     10/11/2013 14:41:26  S    purging job 229.master without checking MOM
>     10/11/2013 14:41:26  S    dequeuing from production, state COMPLETE
>     10/11/2013 14:44:48  S    Unknown Job Id Error
>     10/11/2013 15:03:54  S    Unknown Job Id Error
>
>     ***************************************************************
>
>
>
>
>     On 10/14/2013 11:49 AM, David Beer wrote:
>      > Gus,
>      >
>      > I would try to qterm the server and then restart it without
>     editing the
>      > nodes file to see if that clears it. My guess is it will. It might be
>      > interesting to see a tracejob output for this stuck job.
>      >
>      > David
>      >
>      >
>      > On Fri, Oct 11, 2013 at 5:10 PM, Gus Correa
>     <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>      > <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>> wrote:
>      >
>      >     Thank you David
>      >
>      >     No, I am not moving jobs to another server.
>      >     We have two other clusters running Torque 2.4.11 and Maui
>      >     but they are separate.
>      >
>      >     I think I found the reason for most of this trouble.
>      >     To my surprise, two nodes were running triplicate pbs_mom
>     daemons.
>      >     I don't know how this funny situation came to be,
>      >     probably during my attempts to fix-it-while-in-operation.
>      >     This was totally unintended of course (ie. they're not
>     multi-mom nodes).
>      >     However, this seems to have made the server veeery confused.
>      >
>      >     I rebooted the two nodes (hard reboot was needed).
>      >     After that my test jobs are running, not stuck in Q state.
>      >
>      >     However, the server has a sticky record of a zombie
>      >     job in one of those nodes that doesn't want to go away.
>      >     The job is not even in the queue anymore.
>      >     I purged it with qdel.
>      >     Momctl doesn't show any job on that node (see below).
>      >     However, the server continues to show it in that node record,
>      >     in the output of pbsnodes.
>      >     See below, please.
>      >
>      >     I put that node offline for now.
>      >     I tried to clean up that sticky job with
>      >     qdel -p and qdel -c to no avail.
>      >     I rebooted the node, tried pbsnodes -r node34, etc, nothing
>     worked.
>      >
>      >     I am about to remove the node from the nodes file,
>      >     restart the server, then insert the node in the nodes file again,
>      >     and restart the server again, as a brute-force attempt to
>      >     make the server "forget" about that sticky job.
>      >
>      >     Is there a simple/better way to get rid of that sticky job?
>      >
>      >     I enclose below  how the server shows the node, etc.
>      >
>      >     Thank you for your help,
>      >     Gus Correa
>      >
>      >     *********************************************************
>      >     # pbsnodes node34
>      >     node34
>      >            state = offline
>      >            np = 32
>      >            properties = MHz2300,prod
>      >            ntype = cluster
>      >            jobs =
>      >
>     0/229.master,1/229.master,2/229.master,3/229.master,4/229.master,5/229.master,6/229.master,7/229.master,8/229.master,9/229.master,10/229.master,11/229.master,12/229.master,13/229.master,14/229.master,15/229.master,16/229.master,17/229.master,18/229.master,19/229.master,20/229.master,21/229.master,22/229.master,23/229.master,24/229.master,25/229.master,26/229.master,27/229.master,28/229.master,29/229.master,30/229.master,31/229.master
>      >            status =
>      >
>     rectime=1381531868,varattr=,jobs=,state=free,netload=1523770,gres=,loadave=0.04,ncpus=32,physmem=132137996kb,availmem=146532668kb,totmem=147513348kb,idletime=5446,nusers=0,nsessions=0,uname=Linux
>      >     node34 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49
>     UTC 2013
>      >     x86_64,opsys=linux
>      >            mom_service_port = 15002
>      >            mom_manager_port = 15003
>      >
>      >     ************************************************
>      >
>      >     [root at node34 ~]# /opt/torque/active/sbin/momctl -d 3
>      >
>      >     Host: node34/node34   Version: 4.2.5   PID: 2528
>      >     Server[0]: master (10.10.1.100:15001
>     <http://10.10.1.100:15001> <http://10.10.1.100:15001>)
>      >         Last Msg From Server:   6409 seconds (CLUSTER_ADDRS)
>      >         Last Msg To Server:     6439 seconds
>      >     HomeDirectory:          /opt/torque/active/mom_priv
>      >     stdout/stderr spool directory: '/opt/torque/active/spool/'
>      >     (3092039blocks available)
>      >     NOTE:  syslog enabled
>      >     MOM active:             6409 seconds
>      >     Check Poll Time:        45 seconds
>      >     Server Update Interval: 45 seconds
>      >     LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>      >     Communication Model:    TCP
>      >     MemLocked:              TRUE  (mlock)
>      >     TCP Timeout:            60 seconds
>      >     Prolog:                 /opt/torque/active/mom_priv/prologue
>     (disabled)
>      >     Alarm Time:             0 of 10 seconds
>      >     Trusted Client List:
>      > 10.10.1.1:15003 <http://10.10.1.1:15003>
>     <http://10.10.1.1:15003>,10.10.1.2:15003 <http://10.10.1.2:15003>
>      > <http://10.10.1.2:15003>,10.10.1.3:15003 <http://10.10.1.3:15003>
>      > <http://10.10.1.3:15003>,10.10.1.4:15003 <http://10.10.1.4:15003>
>      > <http://10.10.1.4:15003>,10.10.1.5:15003 <http://10.10.1.5:15003>
>      > <http://10.10.1.5:15003>,10.10.1.6:15003 <http://10.10.1.6:15003>
>      > <http://10.10.1.6:15003>,10.10.1.7:15003 <http://10.10.1.7:15003>
>      > <http://10.10.1.7:15003>,10.10.1.8:15003 <http://10.10.1.8:15003>
>      > <http://10.10.1.8:15003>,10.10.1.9:15003 <http://10.10.1.9:15003>
>      > <http://10.10.1.9:15003>,10.10.1.10:15003 <http://10.10.1.10:15003>
>      > <http://10.10.1.10:15003>,10.10.1.11:15003 <http://10.10.1.11:15003>
>      > <http://10.10.1.11:15003>,10.10.1.12:15003 <http://10.10.1.12:15003>
>      > <http://10.10.1.12:15003>,10.10.1.13:15003 <http://10.10.1.13:15003>
>      > <http://10.10.1.13:15003>,10.10.1.14:15003 <http://10.10.1.14:15003>
>      > <http://10.10.1.14:15003>,10.10.1.15:15003 <http://10.10.1.15:15003>
>      > <http://10.10.1.15:15003>,10.10.1.16:15003 <http://10.10.1.16:15003>
>      > <http://10.10.1.16:15003>,10.10.1.17:15003 <http://10.10.1.17:15003>
>      > <http://10.10.1.17:15003>,10.10.1.18:15003 <http://10.10.1.18:15003>
>      > <http://10.10.1.18:15003>,10.10.1.19:15003 <http://10.10.1.19:15003>
>      > <http://10.10.1.19:15003>,10.10.1.20:15003 <http://10.10.1.20:15003>
>      > <http://10.10.1.20:15003>,10.10.1.21:15003 <http://10.10.1.21:15003>
>      > <http://10.10.1.21:15003>,10.10.1.22:15003 <http://10.10.1.22:15003>
>      > <http://10.10.1.22:15003>,10.10.1.23:15003 <http://10.10.1.23:15003>
>      > <http://10.10.1.23:15003>,10.10.1.24:15003 <http://10.10.1.24:15003>
>      > <http://10.10.1.24:15003>,10.10.1.25:15003 <http://10.10.1.25:15003>
>      > <http://10.10.1.25:15003>,10.10.1.26:15003 <http://10.10.1.26:15003>
>      > <http://10.10.1.26:15003>,10.10.1.27:15003 <http://10.10.1.27:15003>
>      > <http://10.10.1.27:15003>,10.10.1.28:15003 <http://10.10.1.28:15003>
>      > <http://10.10.1.28:15003>,10.10.1.29:15003 <http://10.10.1.29:15003>
>      > <http://10.10.1.29:15003>,10.10.1.30:15003 <http://10.10.1.30:15003>
>      > <http://10.10.1.30:15003>,10.10.1.31:15003 <http://10.10.1.31:15003>
>      > <http://10.10.1.31:15003>,10.10.1.32:15003 <http://10.10.1.32:15003>
>      > <http://10.10.1.32:15003>,10.10.1.33:15003 <http://10.10.1.33:15003>
>      > <http://10.10.1.33:15003>,10.10.1.34:0 <http://10.10.1.34:0>
>      > <http://10.10.1.34:0>,10.10.1.34:15003 <http://10.10.1.34:15003>
>      > <http://10.10.1.34:15003>,10.10.1.100:0 <http://10.10.1.100:0>
>      > <http://10.10.1.100:0>,127.0.0.1:0 <http://127.0.0.1:0>
>     <http://127.0.0.1:0>:
>      >        0
>      >     Copy Command:           /usr/bin/scp -rpB
>      >     NOTE:  no local jobs detected
>      >
>      >     diagnostics complete
>      >
>      >
>      >     *****************************************
>      >
>      >     # qstat 229
>      >     qstat: Unknown Job Id Error 229.master
>      >
>      >     **********************************************
>      >
>      >     On 10/11/2013 01:41 PM, David Beer wrote:
>      > > Gus,
>      > >
>      > > That is a really strange situation.
>      > >
>      > > The error
>      > >
>      > > Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found
>      >     (15086) in
>      > > svr_dequejob, Job has no queue
>      > >
>      > > can't happen around running a job. This is related to a job getting
>      > > routed or moved to a remote server. Are you doing this? Can you
>      >     provide
>      > > a sequence of events that lead to this error?
>      > >
>      > > The other errors:
>      > > Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out (15085) in
>      > > send_job_work, child failed in previous commit request for job
>      >     228.master
>      > >
>      > > can happen during any type of job move: running a job, routing
>     it, or
>      > > moving it to a remote server. However, in most cases there should
>      >     be an
>      > > error message before this that provides more information about
>      >     what the
>      > > failure was. Have you looked through the entire log file around
>     these
>      > > messages to try to find the root cause of the problem?
>      > >
>      > > As far as the question about compatibility - 4.2.6 will resolve the
>      > > issue with pbs_sched and there is no intention to break
>     compatibility
>      > > with Maui.
>      > >
>      > > I'm not sure if the problem you're having is related to what
>     kind of
>      > > scheduler you are using or what the root issue is at this point.
>      > >
>      >
>      >     I also don't know if Maui plays any role on this.
>      >     I was just afraid it might.
>      >     Currently Maui has the standard boilerplate configuration,
>      >     I only added the maui user to the ADMIN1 line.
>      >
>      >     I just ran an interactive job as a regular user.
>      >     The job appeared in R state on qstat,
>      >     but I never received the prompt back from the node,
>      >     until I forced it to run with qrun (as root, of course).
>      >     When I finished the job, logging out of the node,
>      >     I've got two pairs of identical emails from Torque, each
>      >     duplicate numbered with the same job number (229).
>      >
>      >     No, no, there are no duplicate pbs_server running, only one,
>      >     ps shows that.
>      >     So, something is really wedged.
>      >
>      >     If there is any additional diagnostic information that I can
>      >     provide, please let me know.  I'll be happy to send.
>      >
>      >     Thank you,
>      >     Gus
>      >
>      >
>      > >
>      > > On Fri, Oct 11, 2013 at 10:22 AM, Gus Correa
>      > <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>     <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>
>      > > <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>
>     <mailto:gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>>> wrote:
>      > >
>      > >     Dear Torque experts
>      > >
>      > >     I installed Torque 4.2.5 and Maui 3.3.1 in this cluster.
>      > >     For a few days it worked, but now I get jobs stalled in Q state
>      > >     that only run when forced by qrun.
>      > >
>      > >     I get these syslog error messages on the server,
>      > >     repeated time and again:
>      > >
>      > >
>      >
>     **************************************************************************
>      > >     Oct 11 04:19:24 master pbs_server: LOG_ERROR::Job not found
>      >     (15086) in
>      > >     svr_dequejob, Job has no queue
>      > >     Oct 11 04:34:20 master pbs_server: LOG_ERROR::Time out
>     (15085) in
>      > >     send_job_work, child failed in previous commit request for job
>      > >     219.master
>      > >     Oct 11 04:55:55 master pbs_server: LOG_ERROR::Time out
>     (15085) in
>      > >     send_job_work, child failed in previous commit request for job
>      > >     228.master
>      > >
>      > >     ...
>      > >
>      > >     Oct 11 05:31:07 master pbs_server: LOG_ERROR::Batch protocol
>      >     error
>      > >     (15033) in send_job_work, child failed in previous commit
>      >     request for
>      > >     job 219.master
>      > >     Oct 11 05:53:07 master pbs_server: LOG_ERROR::Batch protocol
>      >     error
>      > >     (15033) in send_job_work, child failed in previous commit
>      >     request for
>      > >     job 228.master
>      > >     ...
>      > >
>      >
>     **************************************************************************
>      > >
>      > >     And here are the jobs forever in Q state:
>      > >
>      > >     qstat 219 228
>      > >     Job ID                    Name             User
>      >       Time Use
>      > >     S Queue
>      > >     ------------------------- ---------------- ---------------
>      >     --------
>      > >     - -----
>      > >     219.master                 GC.Base.1981.01  ltmurray
>      >            0 Q
>      > >     production
>      > >     228.master                 g1ms290_lg_1     sw2526
>      >            0 Q
>      > >     production
>      > >
>      > >     ************
>      > >
>      > >     I already restarted pbs_mom and trqauthd on the nodes,
>      > >     restarted pbs_server, trquauthd and maui on the server,
>      > >     repeated the routine many times and nothing seems to help.
>      > >     I even rebooted the nodes, to no avail.
>      > >
>      > >     At this point the machine is already in production, so
>      > >     playing hard ball this way with the nodes is a real pain
>      > >     for me and for the users and their jobs.
>      > >
>      > >     Questions:
>      > >
>      > >     1) What is wrong?
>      > >
>      > >     2) Should I downgrade to the old (hopefully reliable) Torque
>      >     2.5.X?
>      > >
>      > >     3) We know that Torque 4.X.Y currently doesn't work with
>      >     pbs_sched.
>      > >     Does it work with Maui at least?
>      > >     Or only with Moab these days?
>      > >
>      > >     Thank you,
>      > >     Gus Correa
>      > >     _______________________________________________
>      > >     torqueusers mailing list
>      > > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>      > <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>>
>      > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      > >
>      > >
>      > >
>      > >
>      > > --
>      > > David Beer | Senior Software Engineer
>      > > Adaptive Computing
>      > >
>      > >
>      > > _______________________________________________
>      > > torqueusers mailing list
>      > > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >     _______________________________________________
>      >     torqueusers mailing list
>      > torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>
>     <mailto:torqueusers at supercluster.org
>     <mailto:torqueusers at supercluster.org>>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>      >
>      >
>      >
>      >
>      > --
>      > David Beer | Senior Software Engineer
>      > Adaptive Computing
>      >
>      >
>      > _______________________________________________
>      > torqueusers mailing list
>      > torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>      > http://www.supercluster.org/mailman/listinfo/torqueusers
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list