[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Mon Nov 11 12:05:19 MST 2013


Hi Matthew

Somehow on node30 "lsof -i | egrep 122520\|122524" (or even "lsof -i") 
hangs forever (but obviously works on other nodes).
However, node30's eth0 port (which pbs_mom uses) works, and I can ping 
across it, ssh using it, etc.


[root at node30 ~]# lsof -i | egrep 122520\|122524

No response.

And from another ssh session:

[root at node30 ~]# ps aux |grep lsof
root     30138  0.0  0.0 105504  1144 pts/0    D+   13:14   0:00 lsof -i

Likewise, the gdb backtrace also hangs:

[root at node30 ~]# gdb --batch -x /tmp/gdb_backtrace_command 
/opt/torque/active/sbin/pbs_mom 29185

No response.

And from another ssh session:

[root at node30 ~]# cat /tmp/gdb_backtrace_command
bt
[root at node30 ~]# ps aux |grep gdb
root     30176  0.0  0.0 133620  9136 pts/1    S+   13:25   0:00 gdb 
--batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 29185


******************************

Now a similar routine on node32, adjusting the socket and pbs_mom
process numbers.

Socket 305590 was also shared by both the root-owned (pid 20286)
and the user-owned (pids 2932 and 2933) pbs_moms on node32.

[root at node32 ~]# lsof -i | egrep 305589\|305590

Empty output.

However ...

[root at node32 ~]# lsof -i | egrep pbs_mom
pbs_mom   20286   root    5u  IPv4  92105      0t0  TCP *:15002 (LISTEN)
pbs_mom   20286   root    6u  IPv4  92106      0t0  TCP *:15003 (LISTEN)


The gdb backtrace seems to hang on node32 also:

[root at node32 ~]# echo bt > /tmp/gdb_backtrace_command
[root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command 
/opt/torque/active/sbin/pbs_mom 2932

No response.

[root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command 
/opt/torque/active/sbin/pbs_mom 2933

No response.

[root at node32 ~]# ps aux |grep gdb
root      3400  0.0  0.0 133616  9132 pts/0    S+   13:47   0:00 gdb 
--batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 2932
root      3434  0.3  0.0 133616  9136 pts/2    S+   13:57   0:00 gdb 
--batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 2933

I didn't explicitly configure torque "--with-debug" (unless it goes by
default). So, I wonder if we can expect anything from gdb.

Thank you again for all the help.
Gus Correa



On 11/11/2013 12:45 PM, Ezell, Matthew A. wrote:
> We can see that second pbs_mom shares a socket with the master.
>
> Hopefully you have GDB available on the compute nodes.  Can you show me
> the output of the following commands on node30:
>
> lsof -i | egrep 122520\|122524
> echo bt>  /tmp/gdb_backtrace_command
> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
> 29185
> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
> 29187
>
> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
> 17941
>
>
> Thanks,
> ~Matt
>
> ---
> Matt Ezell
> HPC Systems Administrator
> Oak Ridge National Laboratory
>
>
>
>
> On 11/11/13 12:28 PM, "Gus Correa"<gus at ldeo.columbia.edu>  wrote:
>
>> On 11/11/2013 10:56 AM, Ezell, Matthew A. wrote:
>>> Hi Gus-
>>>
>>> I've seen pbs_mom processes owned by users when interactive jobs are
>>> running. The root pbs_mom has to fork() and setuid() to the user to
>>> startup an X11 listener (if requested).  In my case they always seem to
>>> go
>>> away when the job completes.  I assume with the age of these processes,
>>> they did *not* go away when the jobs completed.
>>>
>>> I see that all of your non-root processes are in uninterruptible sleep
>>> (D).  Can you tell what they are waiting on?  Getting a GDB backtrace or
>>> at minimum the listing of /proc/<pid>/fd would be useful.
>>>
>>> Good luck,
>>> ~Matt
>>>
>>> ---
>>> Matt Ezell
>>> HPC Systems Administrator
>>> Oak Ridge National Laboratory
>>>
>>
>> Hi Matthew
>>
>> Thank you for the help!
>>
>> I am thinking of rolling back to Torque 2.5.X or 2.4.Y,
>> but I am not even sure it would solve this nasty problem.
>>
>> Yes, David Beer told me last time that pbs_mom forks into user owned
>> processes, as you also said.
>>
>> Does this happen only on interactive jobs or also on batch jobs?
>> Only if the user requests X11?
>>
>> However, I thought this would happen very fast, just until the
>> user-owned pbs_moms launched the actual user processes.
>> In normal conditions I never had a chance to see these user-owner
>> pbs_moms with "ps".
>>
>> Nevertheless, on the cases I reported, the user-owned pbs_mom
>> processes stick around for some reason, don't go away after the job ends
>> or dies, and seem to break subsequent jobs (which continue to be
>> scheduled for the affected node).
>>
>> I enclose below the listing of /proc/<pbs_mom_procid>/fd for the
>> two affected nodes.
>> I took those nodes offline,
>> but eventually I need to reboot them and put back in production.
>>
>> The last process listed is the root-owned pbs_mom, and seems to
>> still have open the cpuset files of jobs 324 and 537 respectively.
>> Those are the jobs that the user was running while he also owned the
>> pbs_mom processes, on Oct/23 and Nov/06 respectively.
>> Both jobs' stderr and stdout stuck to the nodes' spool,
>> and weren't moved to the user work directory.
>> Job 324 exceeded the walltime (12h).
>> Job 537 seems to have stopped very quickly (2 seconds cputime),
>> and its stderr is empty.
>>
>> [root at node30 ~]# ls -l /opt/torque/active/spool/324.master.ER
>> -rw------- 1 ltmurray ltmurray 3026 Oct 23 02:55
>> /opt/torque/active/spool/324.master.ER
>>
>> [root at node30 ~]# tail -n 2 /opt/torque/active/spool/324.master.ER
>> % Compiled module: STRREPL.
>> =>>  PBS: job killed: walltime 43204 exceeded limit 43200
>>
>>
>> [root at node32 ~]# ls -l /opt/torque/active/spool/537.master.ER
>> -rw------- 1 ltmurray ltmurray 0 Nov  6 12:49
>> /opt/torque/active/spool/537.master.ER
>>
>> Their tracejob is enclosed.
>>
>> This user sometimes runs interactive jobs, probably exporting X
>> to get the IDL or Matlab GUI.
>> However, these two jobs (324 and 537) seem to be batch [and the
>> executable is parallelized with OpenMP (not MPI),
>> but this probably doesn't matter].
>>
>> Thank you for your help,
>> Gus Correa
>>
>> *******************************node30**********************************
>> [root at node30 ~]# ps aux |grep pbs_mom
>> root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17  19:03
>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>> ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>> ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>> root     30099  0.0  0.0 103236   892 pts/0    S+   11:39   0:00 grep
>> pbs_mom
>>
>> [root at node30 ~]# ll /proc/29185/fd
>> total 0
>> lr-x------ 1 root root 64 Nov 11 11:38 0 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:38 1 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:38 2 ->  /dev/null
>> lr-x------ 1 root root 64 Nov 11 11:38 7 ->  /
>> lrwx------ 1 root root 64 Nov 11 11:38 9 ->  socket:[122520]
>>
>> [root at node30 ~]# ll /proc/29187/fd
>> total 0
>> lr-x------ 1 root root 64 Nov 11 11:39 0 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:39 1 ->  /dev/null
>> lrwx------ 1 root root 64 Nov 11 11:39 10 ->  socket:[122524]
>> l-wx------ 1 root root 64 Nov 11 11:39 2 ->  /dev/null
>> lr-x------ 1 root root 64 Nov 11 11:39 7 ->  /
>>
>> [root at node30 ~]# ll /proc/17941/fd
>> total 0
>> lr-x------ 1 root root 64 Nov 11 11:52 0 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:52 1 ->  /dev/null
>> lrwx------ 1 root root 64 Nov 11 11:52 10 ->  socket:[122524]
>> lr-x------ 1 root root 64 Nov 11 11:52 11 ->  /dev/cpuset/torque/324.master
>> l-wx------ 1 root root 64 Nov 11 11:52 2 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:52 3 ->
>> /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131023
>> l-wx------ 1 root root 64 Nov 11 11:52 4 ->
>> /opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>> lrwx------ 1 root root 64 Nov 11 11:52 5 ->  socket:[81628]
>> lrwx------ 1 root root 64 Nov 11 11:52 6 ->  socket:[81629]
>> lr-x------ 1 root root 64 Nov 11 11:52 7 ->  /
>> lr-x------ 1 root root 64 Nov 11 11:52 8 ->  /proc
>> lrwx------ 1 root root 64 Nov 11 11:52 9 ->  socket:[122546]
>>
>> ***********************************************************************
>>
>> *******************************node32**********************************
>> [root at node32 ~]# ps aux |grep pbs_mom
>> ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>> ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>> root      3267  0.0  0.0 103236   892 pts/0    S+   11:59   0:00 grep
>> pbs_mom
>> root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17 109:36
>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>> [root at node32 ~]# ll /proc/2932/fd
>> total 0
>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->  /dev/null
>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->  /
>> lrwx------ 1 root root 64 Nov 11 11:58 9 ->  socket:[305589]
>> [root at node32 ~]# ll /proc/2933/fd
>> total 0
>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->  /dev/null
>> lrwx------ 1 root root 64 Nov 11 11:58 10 ->  socket:[305590]
>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->  /dev/null
>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->  /
>> [root at node32 ~]# ll /proc/20286/fd
>> total 0
>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->  /dev/null
>> lrwx------ 1 root root 64 Nov 11 11:58 10 ->  socket:[305590]
>> lrwx------ 1 root root 64 Nov 11 11:58 11 ->  socket:[305612]
>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->  /dev/null
>> l-wx------ 1 root root 64 Nov 11 11:58 3 ->
>> /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131106
>> l-wx------ 1 root root 64 Nov 11 11:58 4 ->
>> /opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>> lrwx------ 1 root root 64 Nov 11 11:58 5 ->  socket:[92105]
>> lrwx------ 1 root root 64 Nov 11 11:58 6 ->  socket:[92106]
>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->  /
>> lr-x------ 1 root root 64 Nov 11 11:58 8 ->  /proc
>> lr-x------ 1 root root 64 Nov 11 11:58 9 ->  /dev/cpuset/torque/537.master
>> ***********************************************************************
>>
>> ****************************tracejob 324********************************
>> Job: 324.master
>>
>> 10/22/2013 10:48:41  S    enqueuing into production, state 1 hop 1
>> 10/22/2013 10:48:41  A    queue=production
>> 10/22/2013 14:55:35  S    Dependency on job 323.master released.
>> 10/22/2013 14:55:36  S    Job Run at request of maui at master
>> 10/22/2013 14:55:36  A    user=ltmurray group=ltmurray
>> jobname=GC.Base.1984.03 queue=production ctime=1382453321
>> qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
>>
>> exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>> ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>> 4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>> de30/22+node30/23
>>                            Resource_List.neednodes=1:ppn=24
>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>> Resource_List.walltime=12:00:00
>> 10/23/2013 02:55:46  S    Exit_status=-11 resources_used.cput=66:14:47
>> resources_used.mem=4800720kb resources_used.vmem=6672108kb
>> resources_used.walltime=12:00:10
>> 10/23/2013 02:55:46  S    on_job_exit valid pjob: 324.master (substate=50)
>> 10/23/2013 02:55:46  A    user=ltmurray group=ltmurray
>> jobname=GC.Base.1984.03 queue=production ctime=1382453321
>> qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
>>
>> exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>> ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>> 4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>> de30/22+node30/23
>>                            Resource_List.neednodes=1:ppn=24
>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>> Resource_List.walltime=12:00:00 session=28077 end=1382511346
>> Exit_status=-11 resources_used.cput=66:14:47 resources_used.mem=4800720kb
>>                            resources_used.vmem=6672108kb
>> resources_used.walltime=12:00:10
>> 10/23/2013 02:56:13  S    on_job_exit valid pjob: 324.master (substate=52)
>> 10/23/2013 03:05:39  S    Request invalid for state of job EXITING
>> 10/23/2013 03:05:43  S    Request invalid for state of job EXITING
>> 10/23/2013 07:05:52  S    dequeuing from production, state COMPLETE
>>
>> ************************************************************************
>>
>>
>> *******************************tracejob 537****************************
>> Job: 537.master
>>
>> 11/06/2013 12:48:58  S    enqueuing into production, state 1 hop 1
>> 11/06/2013 12:48:58  A    queue=production
>> 11/06/2013 12:49:00  S    Job Run at request of maui at master
>> 11/06/2013 12:49:00  A    user=ltmurray group=ltmurray
>> jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>> qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
>>
>> exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>> ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>> 4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>> de32/22+node32/23
>>                            Resource_List.neednodes=1:ppn=24
>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>> Resource_List.walltime=12:00:00
>> 11/06/2013 13:09:21  S    Job deleted at request of ltmurray at master
>> 11/06/2013 13:09:21  S    Job sent signal SIGTERM on delete
>> 11/06/2013 13:09:21  A    requestor=ltmurray at master
>> 11/06/2013 13:11:06  S    Job sent signal SIGKILL on delete
>> 11/06/2013 13:11:06  S    Exit_status=271 resources_used.cput=00:00:02
>> resources_used.mem=7480172kb resources_used.vmem=9157780kb
>> resources_used.walltime=00:22:06
>> 11/06/2013 13:11:06  S    on_job_exit valid pjob: 537.master (substate=50)
>> 11/06/2013 13:11:06  A    user=ltmurray group=ltmurray
>> jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>> qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
>>
>> exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>> ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>> 4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>> de32/22+node32/23
>>                            Resource_List.neednodes=1:ppn=24
>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>> Resource_List.walltime=12:00:00 session=2857 end=1383761466
>> Exit_status=271 resources_used.cput=00:00:02 resources_used.mem=7480172kb
>>                            resources_used.vmem=9157780kb
>> resources_used.walltime=00:22:06
>> 11/06/2013 13:11:29  S    on_job_exit valid pjob: 537.master (substate=52)
>> 11/06/2013 17:21:09  S    dequeuing from production, state COMPLETE
>> ***********************************************************************
>>
>>
>>
>>> On 11/8/13 3:12 PM, "Gus Correa"<gus at ldeo.columbia.edu>   wrote:
>>>
>>>> Dear Torque experts
>>>>
>>>> I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.
>>>>
>>>> I reported a similar problem a few weeks ago.
>>>> It disappeared after the nodes were rebooted, but it is back.
>>>>
>>>> Some nodes seem to "develop" or create *multiple pbs_mom processes*
>>>> after they run for a while.
>>>> Those are nodes that are used more heavily
>>>> (higher node numbers).
>>>> See the printout below, specially nodes 30-32.
>>>>
>>>> The duplicate pbs_mom processes are owned by a regular user, which is
>>>> rather awkward.
>>>>
>>>> As a result, multi-node jobs that get the affected nodes, fail to run.
>>>> I enclose below a tracejob sample of one of those jobs.
>>>>
>>>> What could be causing this problem, and how can it be solved?
>>>>
>>>> Thank you,
>>>> Gus Correa
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list