[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Ezell, Matthew A. ezellma at ornl.gov
Mon Nov 11 08:56:35 MST 2013


Hi Gus-

I've seen pbs_mom processes owned by users when interactive jobs are
running. The root pbs_mom has to fork() and setuid() to the user to
startup an X11 listener (if requested).  In my case they always seem to go
away when the job completes.  I assume with the age of these processes,
they did *not* go away when the jobs completed.

I see that all of your non-root processes are in uninterruptible sleep
(D).  Can you tell what they are waiting on?  Getting a GDB backtrace or
at minimum the listing of /proc/<pid>/fd would be useful.

Good luck,
~Matt

---
Matt Ezell
HPC Systems Administrator
Oak Ridge National Laboratory




On 11/8/13 3:12 PM, "Gus Correa" <gus at ldeo.columbia.edu> wrote:

>Dear Torque experts
>
>I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.
>
>I reported a similar problem a few weeks ago.
>It disappeared after the nodes were rebooted, but it is back.
>
>Some nodes seem to "develop" or create *multiple pbs_mom processes*
>after they run for a while.
>Those are nodes that are used more heavily
>(higher node numbers).
>See the printout below, specially nodes 30-32.
>
>The duplicate pbs_mom processes are owned by a regular user, which is
>rather awkward.
>
>As a result, multi-node jobs that get the affected nodes, fail to run.
>I enclose below a tracejob sample of one of those jobs.
>
>What could be causing this problem, and how can it be solved?
>
>Thank you,
>Gus Correa
>
>###################################################################
>
>node29: 	root     17789  0.2  0.0  88192 34816 ?        SLsl Oct17
>68:28 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node29: 	root     28902  0.0  0.0 106056  1476 ?        Ss   15:00
>0:00 bash -c ps aux |grep pbs_mom
>node29: 	root     28918  0.0  0.0 103236   888 ?        S    15:00
>0:00 grep pbs_mom
>node30: 	root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17
>18:51 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node30: 	ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23
>0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node30: 	ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23
>0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node30: 	root     29696  0.0  0.0 106056  1476 ?        Ss   15:00
>0:00 bash -c ps aux |grep pbs_mom
>node30: 	root     29712  0.0  0.0 103236   892 ?        S    15:00
>0:00 grep pbs_mom
>node31: 	root       908  0.0  0.0 106056  1476 ?        Ss   15:00
>0:00 bash -c ps aux |grep pbs_mom
>node31: 	root       925  0.0  0.0 103236   892 ?        R    15:00
>0:00 grep pbs_mom
>node31: 	root     18675  0.2  0.0  88192 34812 ?        SLsl Oct17
>82:53 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node32: 	ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06
>0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node32: 	ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06
>0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node32: 	root      3006  0.0  0.0 106056  1472 ?        Ss   15:00
>0:00 bash -c ps aux |grep pbs_mom
>node32: 	root      3022  0.0  0.0 103236   888 ?        S    15:00
>0:00 grep pbs_mom
>node32: 	root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17
>109:23 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node33: 	root      8487  0.2  0.0  87940 34560 ?        SLsl Oct17
>88:30 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>node33: 	root     31855  0.0  0.0 106056  1476 ?        Ss   15:00
>0:00 bash -c ps aux |grep pbs_mom
>node33: 	root     31871  0.0  0.0 103236   892 ?        S    15:00
>0:00 grep pbs_mom
>node34: 	root     13010  0.0  0.0 106056  1476 ?        Ss   15:00
>0:00 bash -c ps aux |grep pbs_mom
>node34: 	root     13026  0.0  0.0 103236   888 ?        S    15:00
>0:00 grep pbs_mom
>node34: 	root     17071  0.3  0.0  88160 34784 ?        SLsl Oct17
>110:01 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>
>
>
>
>###################################################################
>
>
># tracejob 575
>
>/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131108: No such file or directory
>/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131108: No such file or directory
>
>Job: 575.master
>
>11/08/2013 12:18:09  S    enqueuing into production, state 1 hop 1
>11/08/2013 12:18:09  A    queue=production
>11/08/2013 12:18:10  S    Job Run at request of maui at master
>11/08/2013 12:18:10  S    Not sending email: User does not want mail of
>this type.
>11/08/2013 12:18:10  A    user=sw2526 group=sw2526 jobname=test1
>queue=production ctime=1383931089 qtime=1383931089 etime=1383931089
>start=1383931090 owner=sw2526 at brewer.ldeo.columbia.edu
> 
>exec_host=node33/0+node33/1+node33/2+node33/3+node33/4+node33/5+node33/6+n
>ode33/7+node33/8+node33/9+node33/10+node33/11+node33/12+node33/13+node33/1
>4+node33/15+node33/16+node33/17+node33/18+node33/19+node33/20+node33/21+no
>de33/22+node33/23+node33/24+node33/25+node33/26+node33/27+node33/28+node33
>/29+node33/30+node33/31+node32/0+node32/1+node32/2+node32/3+node32/4+node3
>2/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node
>32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/2
>0+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+no
>de32/28+node32/29+node32/30+node32/31
>                           Resource_List.neednodes=2:ppn=32
>Resource_List.nodect=2 Resource_List.nodes=2:ppn=32
>Resource_List.walltime=12:00:00
>11/08/2013 12:27:03  S    DeleteJob issue_Drequest failure, rc = 15033
>11/08/2013 12:27:04  S    Job Run at request of maui at master
>11/08/2013 12:27:06  S    Not sending email: User does not want mail of
>this type.
>11/08/2013 12:27:06  A    user=sw2526 group=sw2526 jobname=test1
>queue=production ctime=1383931089 qtime=1383931089 etime=1383931089
>start=1383931626 owner=sw2526 at brewer.ldeo.columbia.edu
> 
>exec_host=node33/0+node33/1+node33/2+node33/3+node33/4+node33/5+node33/6+n
>ode33/7+node33/8+node33/9+node33/10+node33/11+node33/12+node33/13+node33/1
>4+node33/15+node33/16+node33/17+node33/18+node33/19+node33/20+node33/21+no
>de33/22+node33/23+node33/24+node33/25+node33/26+node33/27+node33/28+node33
>/29+node33/30+node33/31+node32/0+node32/1+node32/2+node32/3+node32/4+node3
>2/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node
>32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/2
>0+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+no
>de32/28+node32/29+node32/30+node32/31
>                           Resource_List.neednodes=2:ppn=32
>Resource_List.nodect=2 Resource_List.nodes=2:ppn=32
>Resource_List.walltime=12:00:00
>11/08/2013 12:32:03  S    Exit_status=0 resources_used.cput=00:00:00
>resources_used.mem=0kb resources_used.vmem=0kb
>resources_used.walltime=00:00:00
>11/08/2013 12:32:03  S    Not sending email: User does not want mail of
>this type.
>11/08/2013 12:32:03  S    on_job_exit valid pjob: 575.master (substate=50)
>11/08/2013 12:32:03  A    user=sw2526 group=sw2526 jobname=test1
>queue=production ctime=1383931089 qtime=1383931089 etime=1383931089
>start=1383931626 owner=sw2526 at brewer.ldeo.columbia.edu
> 
>exec_host=node33/0+node33/1+node33/2+node33/3+node33/4+node33/5+node33/6+n
>ode33/7+node33/8+node33/9+node33/10+node33/11+node33/12+node33/13+node33/1
>4+node33/15+node33/16+node33/17+node33/18+node33/19+node33/20+node33/21+no
>de33/22+node33/23+node33/24+node33/25+node33/26+node33/27+node33/28+node33
>/29+node33/30+node33/31+node32/0+node32/1+node32/2+node32/3+node32/4+node3
>2/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node
>32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/2
>0+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+no
>de32/28+node32/29+node32/30+node32/31
>                           Resource_List.neednodes=2:ppn=32
>Resource_List.nodect=2 Resource_List.nodes=2:ppn=32
>Resource_List.walltime=12:00:00 session=0 end=1383931923 Exit_status=0
>resources_used.cput=00:00:00 resources_used.mem=0kb
>resources_used.vmem=0kb
>                           resources_used.walltime=00:00:00
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list