[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Ezell, Matthew A. ezellma at ornl.gov
Mon Nov 11 10:45:19 MST 2013


We can see that second pbs_mom shares a socket with the master.

Hopefully you have GDB available on the compute nodes.  Can you show me
the output of the following commands on node30:

lsof -i | egrep 122520\|122524
echo bt > /tmp/gdb_backtrace_command
gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
29185
gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
29187

gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
17941


Thanks,
~Matt

---
Matt Ezell
HPC Systems Administrator
Oak Ridge National Laboratory




On 11/11/13 12:28 PM, "Gus Correa" <gus at ldeo.columbia.edu> wrote:

>On 11/11/2013 10:56 AM, Ezell, Matthew A. wrote:
>> Hi Gus-
>>
>> I've seen pbs_mom processes owned by users when interactive jobs are
>> running. The root pbs_mom has to fork() and setuid() to the user to
>> startup an X11 listener (if requested).  In my case they always seem to
>>go
>> away when the job completes.  I assume with the age of these processes,
>> they did *not* go away when the jobs completed.
>>
>> I see that all of your non-root processes are in uninterruptible sleep
>> (D).  Can you tell what they are waiting on?  Getting a GDB backtrace or
>> at minimum the listing of /proc/<pid>/fd would be useful.
>>
>> Good luck,
>> ~Matt
>>
>> ---
>> Matt Ezell
>> HPC Systems Administrator
>> Oak Ridge National Laboratory
>>
>
>Hi Matthew
>
>Thank you for the help!
>
>I am thinking of rolling back to Torque 2.5.X or 2.4.Y,
>but I am not even sure it would solve this nasty problem.
>
>Yes, David Beer told me last time that pbs_mom forks into user owned
>processes, as you also said.
>
>Does this happen only on interactive jobs or also on batch jobs?
>Only if the user requests X11?
>
>However, I thought this would happen very fast, just until the
>user-owned pbs_moms launched the actual user processes.
>In normal conditions I never had a chance to see these user-owner
>pbs_moms with "ps".
>
>Nevertheless, on the cases I reported, the user-owned pbs_mom
>processes stick around for some reason, don't go away after the job ends
>or dies, and seem to break subsequent jobs (which continue to be
>scheduled for the affected node).
>
>I enclose below the listing of /proc/<pbs_mom_procid>/fd for the
>two affected nodes.
>I took those nodes offline,
>but eventually I need to reboot them and put back in production.
>
>The last process listed is the root-owned pbs_mom, and seems to
>still have open the cpuset files of jobs 324 and 537 respectively.
>Those are the jobs that the user was running while he also owned the
>pbs_mom processes, on Oct/23 and Nov/06 respectively.
>Both jobs' stderr and stdout stuck to the nodes' spool,
>and weren't moved to the user work directory.
>Job 324 exceeded the walltime (12h).
>Job 537 seems to have stopped very quickly (2 seconds cputime),
>and its stderr is empty.
>
>[root at node30 ~]# ls -l /opt/torque/active/spool/324.master.ER
>-rw------- 1 ltmurray ltmurray 3026 Oct 23 02:55
>/opt/torque/active/spool/324.master.ER
>
>[root at node30 ~]# tail -n 2 /opt/torque/active/spool/324.master.ER
>% Compiled module: STRREPL.
>=>> PBS: job killed: walltime 43204 exceeded limit 43200
>
>
>[root at node32 ~]# ls -l /opt/torque/active/spool/537.master.ER
>-rw------- 1 ltmurray ltmurray 0 Nov  6 12:49
>/opt/torque/active/spool/537.master.ER
>
>Their tracejob is enclosed.
>
>This user sometimes runs interactive jobs, probably exporting X
>to get the IDL or Matlab GUI.
>However, these two jobs (324 and 537) seem to be batch [and the
>executable is parallelized with OpenMP (not MPI),
>but this probably doesn't matter].
>
>Thank you for your help,
>Gus Correa
>
>*******************************node30**********************************
>[root at node30 ~]# ps aux |grep pbs_mom
>root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17  19:03
>/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>root     30099  0.0  0.0 103236   892 pts/0    S+   11:39   0:00 grep
>pbs_mom
>
>[root at node30 ~]# ll /proc/29185/fd
>total 0
>lr-x------ 1 root root 64 Nov 11 11:38 0 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:38 1 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:38 2 -> /dev/null
>lr-x------ 1 root root 64 Nov 11 11:38 7 -> /
>lrwx------ 1 root root 64 Nov 11 11:38 9 -> socket:[122520]
>
>[root at node30 ~]# ll /proc/29187/fd
>total 0
>lr-x------ 1 root root 64 Nov 11 11:39 0 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:39 1 -> /dev/null
>lrwx------ 1 root root 64 Nov 11 11:39 10 -> socket:[122524]
>l-wx------ 1 root root 64 Nov 11 11:39 2 -> /dev/null
>lr-x------ 1 root root 64 Nov 11 11:39 7 -> /
>
>[root at node30 ~]# ll /proc/17941/fd
>total 0
>lr-x------ 1 root root 64 Nov 11 11:52 0 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:52 1 -> /dev/null
>lrwx------ 1 root root 64 Nov 11 11:52 10 -> socket:[122524]
>lr-x------ 1 root root 64 Nov 11 11:52 11 -> /dev/cpuset/torque/324.master
>l-wx------ 1 root root 64 Nov 11 11:52 2 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:52 3 ->
>/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131023
>l-wx------ 1 root root 64 Nov 11 11:52 4 ->
>/opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>lrwx------ 1 root root 64 Nov 11 11:52 5 -> socket:[81628]
>lrwx------ 1 root root 64 Nov 11 11:52 6 -> socket:[81629]
>lr-x------ 1 root root 64 Nov 11 11:52 7 -> /
>lr-x------ 1 root root 64 Nov 11 11:52 8 -> /proc
>lrwx------ 1 root root 64 Nov 11 11:52 9 -> socket:[122546]
>
>***********************************************************************
>
>*******************************node32**********************************
>[root at node32 ~]# ps aux |grep pbs_mom
>ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>root      3267  0.0  0.0 103236   892 pts/0    S+   11:59   0:00 grep
>pbs_mom
>root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17 109:36
>/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>[root at node32 ~]# ll /proc/2932/fd
>total 0
>lr-x------ 1 root root 64 Nov 11 11:58 0 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:58 1 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:58 2 -> /dev/null
>lr-x------ 1 root root 64 Nov 11 11:58 7 -> /
>lrwx------ 1 root root 64 Nov 11 11:58 9 -> socket:[305589]
>[root at node32 ~]# ll /proc/2933/fd
>total 0
>lr-x------ 1 root root 64 Nov 11 11:58 0 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:58 1 -> /dev/null
>lrwx------ 1 root root 64 Nov 11 11:58 10 -> socket:[305590]
>l-wx------ 1 root root 64 Nov 11 11:58 2 -> /dev/null
>lr-x------ 1 root root 64 Nov 11 11:58 7 -> /
>[root at node32 ~]# ll /proc/20286/fd
>total 0
>lr-x------ 1 root root 64 Nov 11 11:58 0 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:58 1 -> /dev/null
>lrwx------ 1 root root 64 Nov 11 11:58 10 -> socket:[305590]
>lrwx------ 1 root root 64 Nov 11 11:58 11 -> socket:[305612]
>l-wx------ 1 root root 64 Nov 11 11:58 2 -> /dev/null
>l-wx------ 1 root root 64 Nov 11 11:58 3 ->
>/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131106
>l-wx------ 1 root root 64 Nov 11 11:58 4 ->
>/opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>lrwx------ 1 root root 64 Nov 11 11:58 5 -> socket:[92105]
>lrwx------ 1 root root 64 Nov 11 11:58 6 -> socket:[92106]
>lr-x------ 1 root root 64 Nov 11 11:58 7 -> /
>lr-x------ 1 root root 64 Nov 11 11:58 8 -> /proc
>lr-x------ 1 root root 64 Nov 11 11:58 9 -> /dev/cpuset/torque/537.master
>***********************************************************************
>
>****************************tracejob 324********************************
>Job: 324.master
>
>10/22/2013 10:48:41  S    enqueuing into production, state 1 hop 1
>10/22/2013 10:48:41  A    queue=production
>10/22/2013 14:55:35  S    Dependency on job 323.master released.
>10/22/2013 14:55:36  S    Job Run at request of maui at master
>10/22/2013 14:55:36  A    user=ltmurray group=ltmurray
>jobname=GC.Base.1984.03 queue=production ctime=1382453321
>qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
> 
>exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>de30/22+node30/23
>                           Resource_List.neednodes=1:ppn=24
>Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>Resource_List.walltime=12:00:00
>10/23/2013 02:55:46  S    Exit_status=-11 resources_used.cput=66:14:47
>resources_used.mem=4800720kb resources_used.vmem=6672108kb
>resources_used.walltime=12:00:10
>10/23/2013 02:55:46  S    on_job_exit valid pjob: 324.master (substate=50)
>10/23/2013 02:55:46  A    user=ltmurray group=ltmurray
>jobname=GC.Base.1984.03 queue=production ctime=1382453321
>qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
> 
>exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>de30/22+node30/23
>                           Resource_List.neednodes=1:ppn=24
>Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>Resource_List.walltime=12:00:00 session=28077 end=1382511346
>Exit_status=-11 resources_used.cput=66:14:47 resources_used.mem=4800720kb
>                           resources_used.vmem=6672108kb
>resources_used.walltime=12:00:10
>10/23/2013 02:56:13  S    on_job_exit valid pjob: 324.master (substate=52)
>10/23/2013 03:05:39  S    Request invalid for state of job EXITING
>10/23/2013 03:05:43  S    Request invalid for state of job EXITING
>10/23/2013 07:05:52  S    dequeuing from production, state COMPLETE
>
>************************************************************************
>
>
>*******************************tracejob 537****************************
>Job: 537.master
>
>11/06/2013 12:48:58  S    enqueuing into production, state 1 hop 1
>11/06/2013 12:48:58  A    queue=production
>11/06/2013 12:49:00  S    Job Run at request of maui at master
>11/06/2013 12:49:00  A    user=ltmurray group=ltmurray
>jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
> 
>exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>de32/22+node32/23
>                           Resource_List.neednodes=1:ppn=24
>Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>Resource_List.walltime=12:00:00
>11/06/2013 13:09:21  S    Job deleted at request of ltmurray at master
>11/06/2013 13:09:21  S    Job sent signal SIGTERM on delete
>11/06/2013 13:09:21  A    requestor=ltmurray at master
>11/06/2013 13:11:06  S    Job sent signal SIGKILL on delete
>11/06/2013 13:11:06  S    Exit_status=271 resources_used.cput=00:00:02
>resources_used.mem=7480172kb resources_used.vmem=9157780kb
>resources_used.walltime=00:22:06
>11/06/2013 13:11:06  S    on_job_exit valid pjob: 537.master (substate=50)
>11/06/2013 13:11:06  A    user=ltmurray group=ltmurray
>jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
> 
>exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>de32/22+node32/23
>                           Resource_List.neednodes=1:ppn=24
>Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>Resource_List.walltime=12:00:00 session=2857 end=1383761466
>Exit_status=271 resources_used.cput=00:00:02 resources_used.mem=7480172kb
>                           resources_used.vmem=9157780kb
>resources_used.walltime=00:22:06
>11/06/2013 13:11:29  S    on_job_exit valid pjob: 537.master (substate=52)
>11/06/2013 17:21:09  S    dequeuing from production, state COMPLETE
>***********************************************************************
>
>
>
>> On 11/8/13 3:12 PM, "Gus Correa"<gus at ldeo.columbia.edu>  wrote:
>>
>>> Dear Torque experts
>>>
>>> I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.
>>>
>>> I reported a similar problem a few weeks ago.
>>> It disappeared after the nodes were rebooted, but it is back.
>>>
>>> Some nodes seem to "develop" or create *multiple pbs_mom processes*
>>> after they run for a while.
>>> Those are nodes that are used more heavily
>>> (higher node numbers).
>>> See the printout below, specially nodes 30-32.
>>>
>>> The duplicate pbs_mom processes are owned by a regular user, which is
>>> rather awkward.
>>>
>>> As a result, multi-node jobs that get the affected nodes, fail to run.
>>> I enclose below a tracejob sample of one of those jobs.
>>>
>>> What could be causing this problem, and how can it be solved?
>>>
>>> Thank you,
>>> Gus Correa
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list