[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Mon Nov 11 10:28:43 MST 2013


On 11/11/2013 10:56 AM, Ezell, Matthew A. wrote:
> Hi Gus-
>
> I've seen pbs_mom processes owned by users when interactive jobs are
> running. The root pbs_mom has to fork() and setuid() to the user to
> startup an X11 listener (if requested).  In my case they always seem to go
> away when the job completes.  I assume with the age of these processes,
> they did *not* go away when the jobs completed.
>
> I see that all of your non-root processes are in uninterruptible sleep
> (D).  Can you tell what they are waiting on?  Getting a GDB backtrace or
> at minimum the listing of /proc/<pid>/fd would be useful.
>
> Good luck,
> ~Matt
>
> ---
> Matt Ezell
> HPC Systems Administrator
> Oak Ridge National Laboratory
>

Hi Matthew

Thank you for the help!

I am thinking of rolling back to Torque 2.5.X or 2.4.Y,
but I am not even sure it would solve this nasty problem.

Yes, David Beer told me last time that pbs_mom forks into user owned 
processes, as you also said.

Does this happen only on interactive jobs or also on batch jobs?
Only if the user requests X11?

However, I thought this would happen very fast, just until the 
user-owned pbs_moms launched the actual user processes.
In normal conditions I never had a chance to see these user-owner
pbs_moms with "ps".

Nevertheless, on the cases I reported, the user-owned pbs_mom
processes stick around for some reason, don't go away after the job ends 
or dies, and seem to break subsequent jobs (which continue to be 
scheduled for the affected node).

I enclose below the listing of /proc/<pbs_mom_procid>/fd for the
two affected nodes.
I took those nodes offline,
but eventually I need to reboot them and put back in production.

The last process listed is the root-owned pbs_mom, and seems to
still have open the cpuset files of jobs 324 and 537 respectively.
Those are the jobs that the user was running while he also owned the
pbs_mom processes, on Oct/23 and Nov/06 respectively.
Both jobs' stderr and stdout stuck to the nodes' spool,
and weren't moved to the user work directory.
Job 324 exceeded the walltime (12h).
Job 537 seems to have stopped very quickly (2 seconds cputime),
and its stderr is empty.

[root at node30 ~]# ls -l /opt/torque/active/spool/324.master.ER
-rw------- 1 ltmurray ltmurray 3026 Oct 23 02:55 
/opt/torque/active/spool/324.master.ER

[root at node30 ~]# tail -n 2 /opt/torque/active/spool/324.master.ER
% Compiled module: STRREPL.
=>> PBS: job killed: walltime 43204 exceeded limit 43200


[root at node32 ~]# ls -l /opt/torque/active/spool/537.master.ER
-rw------- 1 ltmurray ltmurray 0 Nov  6 12:49 
/opt/torque/active/spool/537.master.ER

Their tracejob is enclosed.

This user sometimes runs interactive jobs, probably exporting X
to get the IDL or Matlab GUI.
However, these two jobs (324 and 537) seem to be batch [and the
executable is parallelized with OpenMP (not MPI),
but this probably doesn't matter].

Thank you for your help,
Gus Correa

*******************************node30**********************************
[root at node30 ~]# ps aux |grep pbs_mom
root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17  19:03 
/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23   0:00 
/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23   0:00 
/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
root     30099  0.0  0.0 103236   892 pts/0    S+   11:39   0:00 grep 
pbs_mom

[root at node30 ~]# ll /proc/29185/fd
total 0
lr-x------ 1 root root 64 Nov 11 11:38 0 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:38 1 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:38 2 -> /dev/null
lr-x------ 1 root root 64 Nov 11 11:38 7 -> /
lrwx------ 1 root root 64 Nov 11 11:38 9 -> socket:[122520]

[root at node30 ~]# ll /proc/29187/fd
total 0
lr-x------ 1 root root 64 Nov 11 11:39 0 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:39 1 -> /dev/null
lrwx------ 1 root root 64 Nov 11 11:39 10 -> socket:[122524]
l-wx------ 1 root root 64 Nov 11 11:39 2 -> /dev/null
lr-x------ 1 root root 64 Nov 11 11:39 7 -> /

[root at node30 ~]# ll /proc/17941/fd
total 0
lr-x------ 1 root root 64 Nov 11 11:52 0 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:52 1 -> /dev/null
lrwx------ 1 root root 64 Nov 11 11:52 10 -> socket:[122524]
lr-x------ 1 root root 64 Nov 11 11:52 11 -> /dev/cpuset/torque/324.master
l-wx------ 1 root root 64 Nov 11 11:52 2 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:52 3 -> 
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131023
l-wx------ 1 root root 64 Nov 11 11:52 4 -> 
/opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
lrwx------ 1 root root 64 Nov 11 11:52 5 -> socket:[81628]
lrwx------ 1 root root 64 Nov 11 11:52 6 -> socket:[81629]
lr-x------ 1 root root 64 Nov 11 11:52 7 -> /
lr-x------ 1 root root 64 Nov 11 11:52 8 -> /proc
lrwx------ 1 root root 64 Nov 11 11:52 9 -> socket:[122546]

***********************************************************************

*******************************node32**********************************
[root at node32 ~]# ps aux |grep pbs_mom
ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06   0:00 
/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06   0:00 
/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
root      3267  0.0  0.0 103236   892 pts/0    S+   11:59   0:00 grep 
pbs_mom
root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17 109:36 
/opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
[root at node32 ~]# ll /proc/2932/fd
total 0
lr-x------ 1 root root 64 Nov 11 11:58 0 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:58 1 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:58 2 -> /dev/null
lr-x------ 1 root root 64 Nov 11 11:58 7 -> /
lrwx------ 1 root root 64 Nov 11 11:58 9 -> socket:[305589]
[root at node32 ~]# ll /proc/2933/fd
total 0
lr-x------ 1 root root 64 Nov 11 11:58 0 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:58 1 -> /dev/null
lrwx------ 1 root root 64 Nov 11 11:58 10 -> socket:[305590]
l-wx------ 1 root root 64 Nov 11 11:58 2 -> /dev/null
lr-x------ 1 root root 64 Nov 11 11:58 7 -> /
[root at node32 ~]# ll /proc/20286/fd
total 0
lr-x------ 1 root root 64 Nov 11 11:58 0 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:58 1 -> /dev/null
lrwx------ 1 root root 64 Nov 11 11:58 10 -> socket:[305590]
lrwx------ 1 root root 64 Nov 11 11:58 11 -> socket:[305612]
l-wx------ 1 root root 64 Nov 11 11:58 2 -> /dev/null
l-wx------ 1 root root 64 Nov 11 11:58 3 -> 
/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131106
l-wx------ 1 root root 64 Nov 11 11:58 4 -> 
/opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
lrwx------ 1 root root 64 Nov 11 11:58 5 -> socket:[92105]
lrwx------ 1 root root 64 Nov 11 11:58 6 -> socket:[92106]
lr-x------ 1 root root 64 Nov 11 11:58 7 -> /
lr-x------ 1 root root 64 Nov 11 11:58 8 -> /proc
lr-x------ 1 root root 64 Nov 11 11:58 9 -> /dev/cpuset/torque/537.master
***********************************************************************

****************************tracejob 324********************************
Job: 324.master

10/22/2013 10:48:41  S    enqueuing into production, state 1 hop 1
10/22/2013 10:48:41  A    queue=production
10/22/2013 14:55:35  S    Dependency on job 323.master released.
10/22/2013 14:55:36  S    Job Run at request of maui at master
10/22/2013 14:55:36  A    user=ltmurray group=ltmurray 
jobname=GC.Base.1984.03 queue=production ctime=1382453321 
qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
 
exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+node30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/14+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+node30/22+node30/23
                           Resource_List.neednodes=1:ppn=24 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=24 
Resource_List.walltime=12:00:00
10/23/2013 02:55:46  S    Exit_status=-11 resources_used.cput=66:14:47 
resources_used.mem=4800720kb resources_used.vmem=6672108kb 
resources_used.walltime=12:00:10
10/23/2013 02:55:46  S    on_job_exit valid pjob: 324.master (substate=50)
10/23/2013 02:55:46  A    user=ltmurray group=ltmurray 
jobname=GC.Base.1984.03 queue=production ctime=1382453321 
qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
 
exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+node30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/14+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+node30/22+node30/23
                           Resource_List.neednodes=1:ppn=24 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=24 
Resource_List.walltime=12:00:00 session=28077 end=1382511346 
Exit_status=-11 resources_used.cput=66:14:47 resources_used.mem=4800720kb
                           resources_used.vmem=6672108kb 
resources_used.walltime=12:00:10
10/23/2013 02:56:13  S    on_job_exit valid pjob: 324.master (substate=52)
10/23/2013 03:05:39  S    Request invalid for state of job EXITING
10/23/2013 03:05:43  S    Request invalid for state of job EXITING
10/23/2013 07:05:52  S    dequeuing from production, state COMPLETE

************************************************************************


*******************************tracejob 537****************************
Job: 537.master

11/06/2013 12:48:58  S    enqueuing into production, state 1 hop 1
11/06/2013 12:48:58  A    queue=production
11/06/2013 12:49:00  S    Job Run at request of maui at master
11/06/2013 12:49:00  A    user=ltmurray group=ltmurray 
jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138 
qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
 
exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23
                           Resource_List.neednodes=1:ppn=24 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=24 
Resource_List.walltime=12:00:00
11/06/2013 13:09:21  S    Job deleted at request of ltmurray at master
11/06/2013 13:09:21  S    Job sent signal SIGTERM on delete
11/06/2013 13:09:21  A    requestor=ltmurray at master
11/06/2013 13:11:06  S    Job sent signal SIGKILL on delete
11/06/2013 13:11:06  S    Exit_status=271 resources_used.cput=00:00:02 
resources_used.mem=7480172kb resources_used.vmem=9157780kb 
resources_used.walltime=00:22:06
11/06/2013 13:11:06  S    on_job_exit valid pjob: 537.master (substate=50)
11/06/2013 13:11:06  A    user=ltmurray group=ltmurray 
jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138 
qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
 
exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23
                           Resource_List.neednodes=1:ppn=24 
Resource_List.nodect=1 Resource_List.nodes=1:ppn=24 
Resource_List.walltime=12:00:00 session=2857 end=1383761466 
Exit_status=271 resources_used.cput=00:00:02 resources_used.mem=7480172kb
                           resources_used.vmem=9157780kb 
resources_used.walltime=00:22:06
11/06/2013 13:11:29  S    on_job_exit valid pjob: 537.master (substate=52)
11/06/2013 17:21:09  S    dequeuing from production, state COMPLETE
***********************************************************************



> On 11/8/13 3:12 PM, "Gus Correa"<gus at ldeo.columbia.edu>  wrote:
>
>> Dear Torque experts
>>
>> I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.
>>
>> I reported a similar problem a few weeks ago.
>> It disappeared after the nodes were rebooted, but it is back.
>>
>> Some nodes seem to "develop" or create *multiple pbs_mom processes*
>> after they run for a while.
>> Those are nodes that are used more heavily
>> (higher node numbers).
>> See the printout below, specially nodes 30-32.
>>
>> The duplicate pbs_mom processes are owned by a regular user, which is
>> rather awkward.
>>
>> As a result, multi-node jobs that get the affected nodes, fail to run.
>> I enclose below a tracejob sample of one of those jobs.
>>
>> What could be causing this problem, and how can it be solved?
>>
>> Thank you,
>> Gus Correa


More information about the torqueusers mailing list