[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Mon Nov 11 13:02:38 MST 2013


PS - In case it may shed some light, I enclose below
the gdb backtrace of the **root-owned** pbs_mom process
on nodes30 & 32.

Although delete_cpuset was called,
somehow the cpuset directories of those two problematic
jobs were not deleted. See listing below, please.

Thank you again,
Gus Correa

**************cpuset directories not deleted*********************

[root at node30 ~]# ll -d  /dev/cpuset/torque/324.master/
drwxr-xr-x 2 root root 0 Oct 22 14:55 /dev/cpuset/torque/324.master/

[root at node30 ~]# ls  /dev/cpuset/torque/324.master/
cgroup.event_control  mem_hardwall        mems
cgroup.procs          memory_migrate      notify_on_release
cpu_exclusive         memory_pressure     sched_load_balance
cpus                  memory_spread_page  sched_relax_domain_level
mem_exclusive         memory_spread_slab  tasks

***

[root at node32 ~]# ll -d /dev/cpuset/torque/537.master/
drwxr-xr-x 2 root root 0 Nov  6 12:49 /dev/cpuset/torque/537.master/

[root at node32 ~]# ls /dev/cpuset/torque/537.master/
cgroup.event_control  mem_hardwall        mems
cgroup.procs          memory_migrate      notify_on_release
cpu_exclusive         memory_pressure     sched_load_balance
cpus                  memory_spread_page  sched_relax_domain_level
mem_exclusive         memory_spread_slab  tasks


**********************************************************************


******************gdb backtrace pbs_mom, node30 **********

[root at node30 ~]# gdb --batch -x /tmp/gdb_backtrace_command 
/opt/torque/active/sbin/pbs_mom 17941
[New LWP 17944]
[New LWP 17943]
[Thread debugging using libthread_db enabled]
0x00002ac6cda34b8d in nanosleep () from /lib64/libc.so.6
#0  0x00002ac6cda34b8d in nanosleep () from /lib64/libc.so.6
#1  0x00002ac6cda34a00 in sleep () from /lib64/libc.so.6
#2  0x00000000004573ec in delete_cpuset (name=0x32748f0 "324.master") at 
../../../../src/resmom/linux/cpuset.c:894
#3  0x00000000004475af in delete_job_files (vp=0x32748f0) at 
../../../src/resmom/mom_job_func.c:682
#4  0x0000000000447b38 in mom_job_purge (pjob=0x3272ff0) at 
../../../src/resmom/mom_job_func.c:859
#5  0x000000000040b57d in mom_deljob (pjob=0x3272ff0) at 
../../../src/resmom/catch_child.c:1741
#6  0x000000000042fe23 in req_deletejob (preq=0x3279430) at 
../../../src/resmom/requests.c:1134
#7  0x0000000000448392 in mom_dispatch_request (sfds=9, 
request=0x3279430) at ../../../src/resmom/mom_process_request.c:363
#8  0x000000000044827b in mom_process_request (sock_num=0x7ffff18a1e60) 
at ../../../src/resmom/mom_process_request.c:293
#9  0x00002ac6cb998a22 in wait_request (waittime=1, SState=0x0) at 
../../../../src/lib/Libpbs/../Libnet/net_server.c:678
#10 0x00000000004286e7 in main_loop () at 
../../../src/resmom/mom_main.c:9134
#11 0x0000000000428974 in main (argc=4, argv=0x7ffff18a2058) at 
../../../src/resmom/mom_main.c:9582


******************gdb backtrace pbs_mom, node 32 **********

[root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command 
/opt/torque/active/sbin/pbs_mom 20286
[New LWP 20289]
[New LWP 20288]
[Thread debugging using libthread_db enabled]
0x00002b8b3de8cb8d in nanosleep () from /lib64/libc.so.6
#0  0x00002b8b3de8cb8d in nanosleep () from /lib64/libc.so.6
#1  0x00002b8b3de8ca00 in sleep () from /lib64/libc.so.6
#2  0x00000000004573ec in delete_cpuset (name=0x208fd60 "537.master") at 
../../../../src/resmom/linux/cpuset.c:894
#3  0x00000000004475af in delete_job_files (vp=0x208fd60) at 
../../../src/resmom/mom_job_func.c:682
#4  0x0000000000447b38 in mom_job_purge (pjob=0x20d2480) at 
../../../src/resmom/mom_job_func.c:859
#5  0x000000000040b57d in mom_deljob (pjob=0x20d2480) at 
../../../src/resmom/catch_child.c:1741
#6  0x000000000042fe23 in req_deletejob (preq=0x20d40e0) at 
../../../src/resmom/requests.c:1134
#7  0x0000000000448392 in mom_dispatch_request (sfds=11, 
request=0x20d40e0) at ../../../src/resmom/mom_process_request.c:363
#8  0x000000000044827b in mom_process_request (sock_num=0x7fff7d472970) 
at ../../../src/resmom/mom_process_request.c:293
#9  0x00002b8b3bdf0a22 in wait_request (waittime=1, SState=0x0) at 
../../../../src/lib/Libpbs/../Libnet/net_server.c:678
#10 0x00000000004286e7 in main_loop () at 
../../../src/resmom/mom_main.c:9134
#11 0x0000000000428974 in main (argc=4, argv=0x7fff7d472b68) at 
../../../src/resmom/mom_main.c:9582

On 11/11/2013 02:05 PM, Gus Correa wrote:
> Hi Matthew
>
> Somehow on node30 "lsof -i | egrep 122520\|122524" (or even "lsof -i")
> hangs forever (but obviously works on other nodes).
> However, node30's eth0 port (which pbs_mom uses) works, and I can ping
> across it, ssh using it, etc.
>
>
> [root at node30 ~]# lsof -i | egrep 122520\|122524
>
> No response.
>
> And from another ssh session:
>
> [root at node30 ~]# ps aux |grep lsof
> root     30138  0.0  0.0 105504  1144 pts/0    D+   13:14   0:00 lsof -i
>
> Likewise, the gdb backtrace also hangs:
>
> [root at node30 ~]# gdb --batch -x /tmp/gdb_backtrace_command
> /opt/torque/active/sbin/pbs_mom 29185
>
> No response.
>
> And from another ssh session:
>
> [root at node30 ~]# cat /tmp/gdb_backtrace_command
> bt
> [root at node30 ~]# ps aux |grep gdb
> root     30176  0.0  0.0 133620  9136 pts/1    S+   13:25   0:00 gdb
> --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 29185
>
>
> ******************************
>
> Now a similar routine on node32, adjusting the socket and pbs_mom
> process numbers.
>
> Socket 305590 was also shared by both the root-owned (pid 20286)
> and the user-owned (pids 2932 and 2933) pbs_moms on node32.
>
> [root at node32 ~]# lsof -i | egrep 305589\|305590
>
> Empty output.
>
> However ...
>
> [root at node32 ~]# lsof -i | egrep pbs_mom
> pbs_mom   20286   root    5u  IPv4  92105      0t0  TCP *:15002 (LISTEN)
> pbs_mom   20286   root    6u  IPv4  92106      0t0  TCP *:15003 (LISTEN)
>
>
> The gdb backtrace seems to hang on node32 also:
>
> [root at node32 ~]# echo bt>  /tmp/gdb_backtrace_command
> [root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command
> /opt/torque/active/sbin/pbs_mom 2932
>
> No response.
>
> [root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command
> /opt/torque/active/sbin/pbs_mom 2933
>
> No response.
>
> [root at node32 ~]# ps aux |grep gdb
> root      3400  0.0  0.0 133616  9132 pts/0    S+   13:47   0:00 gdb
> --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 2932
> root      3434  0.3  0.0 133616  9136 pts/2    S+   13:57   0:00 gdb
> --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 2933
>
> I didn't explicitly configure torque "--with-debug" (unless it goes by
> default). So, I wonder if we can expect anything from gdb.
>
> Thank you again for all the help.
> Gus Correa
>
>
>
> On 11/11/2013 12:45 PM, Ezell, Matthew A. wrote:
>> We can see that second pbs_mom shares a socket with the master.
>>
>> Hopefully you have GDB available on the compute nodes.  Can you show me
>> the output of the following commands on node30:
>>
>> lsof -i | egrep 122520\|122524
>> echo bt>   /tmp/gdb_backtrace_command
>> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
>> 29185
>> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
>> 29187
>>
>> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
>> 17941
>>
>>
>> Thanks,
>> ~Matt
>>
>> ---
>> Matt Ezell
>> HPC Systems Administrator
>> Oak Ridge National Laboratory
>>
>>
>>
>>
>> On 11/11/13 12:28 PM, "Gus Correa"<gus at ldeo.columbia.edu>   wrote:
>>
>>> On 11/11/2013 10:56 AM, Ezell, Matthew A. wrote:
>>>> Hi Gus-
>>>>
>>>> I've seen pbs_mom processes owned by users when interactive jobs are
>>>> running. The root pbs_mom has to fork() and setuid() to the user to
>>>> startup an X11 listener (if requested).  In my case they always seem to
>>>> go
>>>> away when the job completes.  I assume with the age of these processes,
>>>> they did *not* go away when the jobs completed.
>>>>
>>>> I see that all of your non-root processes are in uninterruptible sleep
>>>> (D).  Can you tell what they are waiting on?  Getting a GDB backtrace or
>>>> at minimum the listing of /proc/<pid>/fd would be useful.
>>>>
>>>> Good luck,
>>>> ~Matt
>>>>
>>>> ---
>>>> Matt Ezell
>>>> HPC Systems Administrator
>>>> Oak Ridge National Laboratory
>>>>
>>>
>>> Hi Matthew
>>>
>>> Thank you for the help!
>>>
>>> I am thinking of rolling back to Torque 2.5.X or 2.4.Y,
>>> but I am not even sure it would solve this nasty problem.
>>>
>>> Yes, David Beer told me last time that pbs_mom forks into user owned
>>> processes, as you also said.
>>>
>>> Does this happen only on interactive jobs or also on batch jobs?
>>> Only if the user requests X11?
>>>
>>> However, I thought this would happen very fast, just until the
>>> user-owned pbs_moms launched the actual user processes.
>>> In normal conditions I never had a chance to see these user-owner
>>> pbs_moms with "ps".
>>>
>>> Nevertheless, on the cases I reported, the user-owned pbs_mom
>>> processes stick around for some reason, don't go away after the job ends
>>> or dies, and seem to break subsequent jobs (which continue to be
>>> scheduled for the affected node).
>>>
>>> I enclose below the listing of /proc/<pbs_mom_procid>/fd for the
>>> two affected nodes.
>>> I took those nodes offline,
>>> but eventually I need to reboot them and put back in production.
>>>
>>> The last process listed is the root-owned pbs_mom, and seems to
>>> still have open the cpuset files of jobs 324 and 537 respectively.
>>> Those are the jobs that the user was running while he also owned the
>>> pbs_mom processes, on Oct/23 and Nov/06 respectively.
>>> Both jobs' stderr and stdout stuck to the nodes' spool,
>>> and weren't moved to the user work directory.
>>> Job 324 exceeded the walltime (12h).
>>> Job 537 seems to have stopped very quickly (2 seconds cputime),
>>> and its stderr is empty.
>>>
>>> [root at node30 ~]# ls -l /opt/torque/active/spool/324.master.ER
>>> -rw------- 1 ltmurray ltmurray 3026 Oct 23 02:55
>>> /opt/torque/active/spool/324.master.ER
>>>
>>> [root at node30 ~]# tail -n 2 /opt/torque/active/spool/324.master.ER
>>> % Compiled module: STRREPL.
>>> =>>   PBS: job killed: walltime 43204 exceeded limit 43200
>>>
>>>
>>> [root at node32 ~]# ls -l /opt/torque/active/spool/537.master.ER
>>> -rw------- 1 ltmurray ltmurray 0 Nov  6 12:49
>>> /opt/torque/active/spool/537.master.ER
>>>
>>> Their tracejob is enclosed.
>>>
>>> This user sometimes runs interactive jobs, probably exporting X
>>> to get the IDL or Matlab GUI.
>>> However, these two jobs (324 and 537) seem to be batch [and the
>>> executable is parallelized with OpenMP (not MPI),
>>> but this probably doesn't matter].
>>>
>>> Thank you for your help,
>>> Gus Correa
>>>
>>> *******************************node30**********************************
>>> [root at node30 ~]# ps aux |grep pbs_mom
>>> root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17  19:03
>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>> ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>> ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>> root     30099  0.0  0.0 103236   892 pts/0    S+   11:39   0:00 grep
>>> pbs_mom
>>>
>>> [root at node30 ~]# ll /proc/29185/fd
>>> total 0
>>> lr-x------ 1 root root 64 Nov 11 11:38 0 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:38 1 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:38 2 ->   /dev/null
>>> lr-x------ 1 root root 64 Nov 11 11:38 7 ->   /
>>> lrwx------ 1 root root 64 Nov 11 11:38 9 ->   socket:[122520]
>>>
>>> [root at node30 ~]# ll /proc/29187/fd
>>> total 0
>>> lr-x------ 1 root root 64 Nov 11 11:39 0 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:39 1 ->   /dev/null
>>> lrwx------ 1 root root 64 Nov 11 11:39 10 ->   socket:[122524]
>>> l-wx------ 1 root root 64 Nov 11 11:39 2 ->   /dev/null
>>> lr-x------ 1 root root 64 Nov 11 11:39 7 ->   /
>>>
>>> [root at node30 ~]# ll /proc/17941/fd
>>> total 0
>>> lr-x------ 1 root root 64 Nov 11 11:52 0 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:52 1 ->   /dev/null
>>> lrwx------ 1 root root 64 Nov 11 11:52 10 ->   socket:[122524]
>>> lr-x------ 1 root root 64 Nov 11 11:52 11 ->   /dev/cpuset/torque/324.master
>>> l-wx------ 1 root root 64 Nov 11 11:52 2 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:52 3 ->
>>> /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131023
>>> l-wx------ 1 root root 64 Nov 11 11:52 4 ->
>>> /opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>>> lrwx------ 1 root root 64 Nov 11 11:52 5 ->   socket:[81628]
>>> lrwx------ 1 root root 64 Nov 11 11:52 6 ->   socket:[81629]
>>> lr-x------ 1 root root 64 Nov 11 11:52 7 ->   /
>>> lr-x------ 1 root root 64 Nov 11 11:52 8 ->   /proc
>>> lrwx------ 1 root root 64 Nov 11 11:52 9 ->   socket:[122546]
>>>
>>> ***********************************************************************
>>>
>>> *******************************node32**********************************
>>> [root at node32 ~]# ps aux |grep pbs_mom
>>> ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>> ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>> root      3267  0.0  0.0 103236   892 pts/0    S+   11:59   0:00 grep
>>> pbs_mom
>>> root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17 109:36
>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>> [root at node32 ~]# ll /proc/2932/fd
>>> total 0
>>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->   /dev/null
>>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->   /
>>> lrwx------ 1 root root 64 Nov 11 11:58 9 ->   socket:[305589]
>>> [root at node32 ~]# ll /proc/2933/fd
>>> total 0
>>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->   /dev/null
>>> lrwx------ 1 root root 64 Nov 11 11:58 10 ->   socket:[305590]
>>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->   /dev/null
>>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->   /
>>> [root at node32 ~]# ll /proc/20286/fd
>>> total 0
>>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->   /dev/null
>>> lrwx------ 1 root root 64 Nov 11 11:58 10 ->   socket:[305590]
>>> lrwx------ 1 root root 64 Nov 11 11:58 11 ->   socket:[305612]
>>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->   /dev/null
>>> l-wx------ 1 root root 64 Nov 11 11:58 3 ->
>>> /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131106
>>> l-wx------ 1 root root 64 Nov 11 11:58 4 ->
>>> /opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>>> lrwx------ 1 root root 64 Nov 11 11:58 5 ->   socket:[92105]
>>> lrwx------ 1 root root 64 Nov 11 11:58 6 ->   socket:[92106]
>>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->   /
>>> lr-x------ 1 root root 64 Nov 11 11:58 8 ->   /proc
>>> lr-x------ 1 root root 64 Nov 11 11:58 9 ->   /dev/cpuset/torque/537.master
>>> ***********************************************************************
>>>
>>> ****************************tracejob 324********************************
>>> Job: 324.master
>>>
>>> 10/22/2013 10:48:41  S    enqueuing into production, state 1 hop 1
>>> 10/22/2013 10:48:41  A    queue=production
>>> 10/22/2013 14:55:35  S    Dependency on job 323.master released.
>>> 10/22/2013 14:55:36  S    Job Run at request of maui at master
>>> 10/22/2013 14:55:36  A    user=ltmurray group=ltmurray
>>> jobname=GC.Base.1984.03 queue=production ctime=1382453321
>>> qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
>>>
>>> exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>>> ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>>> 4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>>> de30/22+node30/23
>>>                             Resource_List.neednodes=1:ppn=24
>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>> Resource_List.walltime=12:00:00
>>> 10/23/2013 02:55:46  S    Exit_status=-11 resources_used.cput=66:14:47
>>> resources_used.mem=4800720kb resources_used.vmem=6672108kb
>>> resources_used.walltime=12:00:10
>>> 10/23/2013 02:55:46  S    on_job_exit valid pjob: 324.master (substate=50)
>>> 10/23/2013 02:55:46  A    user=ltmurray group=ltmurray
>>> jobname=GC.Base.1984.03 queue=production ctime=1382453321
>>> qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
>>>
>>> exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>>> ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>>> 4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>>> de30/22+node30/23
>>>                             Resource_List.neednodes=1:ppn=24
>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>> Resource_List.walltime=12:00:00 session=28077 end=1382511346
>>> Exit_status=-11 resources_used.cput=66:14:47 resources_used.mem=4800720kb
>>>                             resources_used.vmem=6672108kb
>>> resources_used.walltime=12:00:10
>>> 10/23/2013 02:56:13  S    on_job_exit valid pjob: 324.master (substate=52)
>>> 10/23/2013 03:05:39  S    Request invalid for state of job EXITING
>>> 10/23/2013 03:05:43  S    Request invalid for state of job EXITING
>>> 10/23/2013 07:05:52  S    dequeuing from production, state COMPLETE
>>>
>>> ************************************************************************
>>>
>>>
>>> *******************************tracejob 537****************************
>>> Job: 537.master
>>>
>>> 11/06/2013 12:48:58  S    enqueuing into production, state 1 hop 1
>>> 11/06/2013 12:48:58  A    queue=production
>>> 11/06/2013 12:49:00  S    Job Run at request of maui at master
>>> 11/06/2013 12:49:00  A    user=ltmurray group=ltmurray
>>> jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>>> qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
>>>
>>> exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>>> ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>>> 4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>>> de32/22+node32/23
>>>                             Resource_List.neednodes=1:ppn=24
>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>> Resource_List.walltime=12:00:00
>>> 11/06/2013 13:09:21  S    Job deleted at request of ltmurray at master
>>> 11/06/2013 13:09:21  S    Job sent signal SIGTERM on delete
>>> 11/06/2013 13:09:21  A    requestor=ltmurray at master
>>> 11/06/2013 13:11:06  S    Job sent signal SIGKILL on delete
>>> 11/06/2013 13:11:06  S    Exit_status=271 resources_used.cput=00:00:02
>>> resources_used.mem=7480172kb resources_used.vmem=9157780kb
>>> resources_used.walltime=00:22:06
>>> 11/06/2013 13:11:06  S    on_job_exit valid pjob: 537.master (substate=50)
>>> 11/06/2013 13:11:06  A    user=ltmurray group=ltmurray
>>> jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>>> qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
>>>
>>> exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>>> ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>>> 4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>>> de32/22+node32/23
>>>                             Resource_List.neednodes=1:ppn=24
>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>> Resource_List.walltime=12:00:00 session=2857 end=1383761466
>>> Exit_status=271 resources_used.cput=00:00:02 resources_used.mem=7480172kb
>>>                             resources_used.vmem=9157780kb
>>> resources_used.walltime=00:22:06
>>> 11/06/2013 13:11:29  S    on_job_exit valid pjob: 537.master (substate=52)
>>> 11/06/2013 17:21:09  S    dequeuing from production, state COMPLETE
>>> ***********************************************************************
>>>
>>>
>>>
>>>> On 11/8/13 3:12 PM, "Gus Correa"<gus at ldeo.columbia.edu>    wrote:
>>>>
>>>>> Dear Torque experts
>>>>>
>>>>> I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.
>>>>>
>>>>> I reported a similar problem a few weeks ago.
>>>>> It disappeared after the nodes were rebooted, but it is back.
>>>>>
>>>>> Some nodes seem to "develop" or create *multiple pbs_mom processes*
>>>>> after they run for a while.
>>>>> Those are nodes that are used more heavily
>>>>> (higher node numbers).
>>>>> See the printout below, specially nodes 30-32.
>>>>>
>>>>> The duplicate pbs_mom processes are owned by a regular user, which is
>>>>> rather awkward.
>>>>>
>>>>> As a result, multi-node jobs that get the affected nodes, fail to run.
>>>>> I enclose below a tracejob sample of one of those jobs.
>>>>>
>>>>> What could be causing this problem, and how can it be solved?
>>>>>
>>>>> Thank you,
>>>>> Gus Correa
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list