[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Mon Nov 11 15:58:26 MST 2013


PS-2:  The job termination by pbs_mom was incomplete in both cases.
The mom_logs show each job (324 and 537) as terminated.
However, at least one process from each job hung in there
(resp. PIDs 28390 and 2899).
This is besides those user-owned pbs_mom processes.
A listing is enclosed.


Thank you,
Gus Correa

node30 ----------------

[root at node30 ~]# tail  /opt/torque/active/mom_logs/20131023
10/23/2013 02:55:46;0008;   pbs_mom.17941;Job;324.master;kill_task: 
killing pid 28390 task 1 with sig 9
10/23/2013 02:55:46;0008;   pbs_mom.17941;Job;324.master;kill_task: not 
killing process (pid=28390/state=Z) with sig 9
10/23/2013 02:55:46;0080; 
pbs_mom.17941;Job;324.master;scan_for_terminated: job 324.master task 1 
terminated, sid=28077
10/23/2013 02:55:46;0008;   pbs_mom.17941;Job;324.master;job was terminated
10/23/2013 02:55:46;0080;   pbs_mom.17941;Svr;preobit_preparation;top
10/23/2013 02:55:46;0080;   pbs_mom.17941;Job;324.master;obit sent to server
10/23/2013 02:57:32;0002;   pbs_mom.17941;Svr;pbs_mom;Torque Mom Version 
= 4.2.5, loglevel = 0
10/23/2013 03:00:46;0080;   pbs_mom.17941;Job;324.master;obit sent to server
10/23/2013 03:00:46;0001;   pbs_mom.17941;Job;324.master;Job already in 
exit state on server. Setting to exited
10/23/2013 03:00:46;0001;   pbs_mom.17941;Req;obit_reply;Job not found 
for obit reply

[root at node30 ~]#  ps -f 28390
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
ltmurray 28390     1  2 Oct22 ?        Zl   692:15 [geos.base] <defunct>



node32 ----------------

[root at node32 ~]# ps -f 2899
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
ltmurray  2899     1  0 Nov06 ?        D      0:03 [geos.AQAST]
[root at node32 ~]# tail /opt/torque/active/mom_logs/20131106
11/06/2013 13:11:05;0008;   pbs_mom.20286;Job;537.master;kill_task: 
process (pid=2899/state=D) after sig 15
11/06/2013 13:11:06;0008;   pbs_mom.20286;Job;537.master;kill_task: 
killing pid 2899 task 1 with sig 9
11/06/2013 13:11:06;0080; 
pbs_mom.20286;Job;537.master;scan_for_terminated: job 537.master task 1 
terminated, sid=2857
11/06/2013 13:11:06;0008;   pbs_mom.20286;Job;537.master;job was terminated
11/06/2013 13:11:06;0080;   pbs_mom.20286;Svr;preobit_preparation;top
11/06/2013 13:11:06;0080;   pbs_mom.20286;Job;537.master;obit sent to server
11/06/2013 13:12:33;0002;   pbs_mom.20286;Svr;pbs_mom;Torque Mom Version 
= 4.2.5, loglevel = 0
11/06/2013 13:16:06;0080;   pbs_mom.20286;Job;537.master;obit sent to server
11/06/2013 13:16:06;0001;   pbs_mom.20286;Job;537.master;Job already in 
exit state on server. Setting to exited
11/06/2013 13:16:06;0001;   pbs_mom.20286;Req;obit_reply;Job not found 
for obit reply

[root at node32 ~]# ps -f 2899
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
ltmurray  2899     1  0 Nov06 ?        D      0:03 [geos.AQAST]


On 11/11/2013 03:02 PM, Gus Correa wrote:
> PS - In case it may shed some light, I enclose below
> the gdb backtrace of the **root-owned** pbs_mom process
> on nodes30&  32.
>
> Although delete_cpuset was called,
> somehow the cpuset directories of those two problematic
> jobs were not deleted. See listing below, please.
>
> Thank you again,
> Gus Correa
>
> **************cpuset directories not deleted*********************
>
> [root at node30 ~]# ll -d  /dev/cpuset/torque/324.master/
> drwxr-xr-x 2 root root 0 Oct 22 14:55 /dev/cpuset/torque/324.master/
>
> [root at node30 ~]# ls  /dev/cpuset/torque/324.master/
> cgroup.event_control  mem_hardwall        mems
> cgroup.procs          memory_migrate      notify_on_release
> cpu_exclusive         memory_pressure     sched_load_balance
> cpus                  memory_spread_page  sched_relax_domain_level
> mem_exclusive         memory_spread_slab  tasks
>
> ***
>
> [root at node32 ~]# ll -d /dev/cpuset/torque/537.master/
> drwxr-xr-x 2 root root 0 Nov  6 12:49 /dev/cpuset/torque/537.master/
>
> [root at node32 ~]# ls /dev/cpuset/torque/537.master/
> cgroup.event_control  mem_hardwall        mems
> cgroup.procs          memory_migrate      notify_on_release
> cpu_exclusive         memory_pressure     sched_load_balance
> cpus                  memory_spread_page  sched_relax_domain_level
> mem_exclusive         memory_spread_slab  tasks
>
>
> **********************************************************************
>
>
> ******************gdb backtrace pbs_mom, node30 **********
>
> [root at node30 ~]# gdb --batch -x /tmp/gdb_backtrace_command
> /opt/torque/active/sbin/pbs_mom 17941
> [New LWP 17944]
> [New LWP 17943]
> [Thread debugging using libthread_db enabled]
> 0x00002ac6cda34b8d in nanosleep () from /lib64/libc.so.6
> #0  0x00002ac6cda34b8d in nanosleep () from /lib64/libc.so.6
> #1  0x00002ac6cda34a00 in sleep () from /lib64/libc.so.6
> #2  0x00000000004573ec in delete_cpuset (name=0x32748f0 "324.master") at
> ../../../../src/resmom/linux/cpuset.c:894
> #3  0x00000000004475af in delete_job_files (vp=0x32748f0) at
> ../../../src/resmom/mom_job_func.c:682
> #4  0x0000000000447b38 in mom_job_purge (pjob=0x3272ff0) at
> ../../../src/resmom/mom_job_func.c:859
> #5  0x000000000040b57d in mom_deljob (pjob=0x3272ff0) at
> ../../../src/resmom/catch_child.c:1741
> #6  0x000000000042fe23 in req_deletejob (preq=0x3279430) at
> ../../../src/resmom/requests.c:1134
> #7  0x0000000000448392 in mom_dispatch_request (sfds=9,
> request=0x3279430) at ../../../src/resmom/mom_process_request.c:363
> #8  0x000000000044827b in mom_process_request (sock_num=0x7ffff18a1e60)
> at ../../../src/resmom/mom_process_request.c:293
> #9  0x00002ac6cb998a22 in wait_request (waittime=1, SState=0x0) at
> ../../../../src/lib/Libpbs/../Libnet/net_server.c:678
> #10 0x00000000004286e7 in main_loop () at
> ../../../src/resmom/mom_main.c:9134
> #11 0x0000000000428974 in main (argc=4, argv=0x7ffff18a2058) at
> ../../../src/resmom/mom_main.c:9582
>
>
> ******************gdb backtrace pbs_mom, node 32 **********
>
> [root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command
> /opt/torque/active/sbin/pbs_mom 20286
> [New LWP 20289]
> [New LWP 20288]
> [Thread debugging using libthread_db enabled]
> 0x00002b8b3de8cb8d in nanosleep () from /lib64/libc.so.6
> #0  0x00002b8b3de8cb8d in nanosleep () from /lib64/libc.so.6
> #1  0x00002b8b3de8ca00 in sleep () from /lib64/libc.so.6
> #2  0x00000000004573ec in delete_cpuset (name=0x208fd60 "537.master") at
> ../../../../src/resmom/linux/cpuset.c:894
> #3  0x00000000004475af in delete_job_files (vp=0x208fd60) at
> ../../../src/resmom/mom_job_func.c:682
> #4  0x0000000000447b38 in mom_job_purge (pjob=0x20d2480) at
> ../../../src/resmom/mom_job_func.c:859
> #5  0x000000000040b57d in mom_deljob (pjob=0x20d2480) at
> ../../../src/resmom/catch_child.c:1741
> #6  0x000000000042fe23 in req_deletejob (preq=0x20d40e0) at
> ../../../src/resmom/requests.c:1134
> #7  0x0000000000448392 in mom_dispatch_request (sfds=11,
> request=0x20d40e0) at ../../../src/resmom/mom_process_request.c:363
> #8  0x000000000044827b in mom_process_request (sock_num=0x7fff7d472970)
> at ../../../src/resmom/mom_process_request.c:293
> #9  0x00002b8b3bdf0a22 in wait_request (waittime=1, SState=0x0) at
> ../../../../src/lib/Libpbs/../Libnet/net_server.c:678
> #10 0x00000000004286e7 in main_loop () at
> ../../../src/resmom/mom_main.c:9134
> #11 0x0000000000428974 in main (argc=4, argv=0x7fff7d472b68) at
> ../../../src/resmom/mom_main.c:9582
>
> On 11/11/2013 02:05 PM, Gus Correa wrote:
>> Hi Matthew
>>
>> Somehow on node30 "lsof -i | egrep 122520\|122524" (or even "lsof -i")
>> hangs forever (but obviously works on other nodes).
>> However, node30's eth0 port (which pbs_mom uses) works, and I can ping
>> across it, ssh using it, etc.
>>
>>
>> [root at node30 ~]# lsof -i | egrep 122520\|122524
>>
>> No response.
>>
>> And from another ssh session:
>>
>> [root at node30 ~]# ps aux |grep lsof
>> root     30138  0.0  0.0 105504  1144 pts/0    D+   13:14   0:00 lsof -i
>>
>> Likewise, the gdb backtrace also hangs:
>>
>> [root at node30 ~]# gdb --batch -x /tmp/gdb_backtrace_command
>> /opt/torque/active/sbin/pbs_mom 29185
>>
>> No response.
>>
>> And from another ssh session:
>>
>> [root at node30 ~]# cat /tmp/gdb_backtrace_command
>> bt
>> [root at node30 ~]# ps aux |grep gdb
>> root     30176  0.0  0.0 133620  9136 pts/1    S+   13:25   0:00 gdb
>> --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 29185
>>
>>
>> ******************************
>>
>> Now a similar routine on node32, adjusting the socket and pbs_mom
>> process numbers.
>>
>> Socket 305590 was also shared by both the root-owned (pid 20286)
>> and the user-owned (pids 2932 and 2933) pbs_moms on node32.
>>
>> [root at node32 ~]# lsof -i | egrep 305589\|305590
>>
>> Empty output.
>>
>> However ...
>>
>> [root at node32 ~]# lsof -i | egrep pbs_mom
>> pbs_mom   20286   root    5u  IPv4  92105      0t0  TCP *:15002 (LISTEN)
>> pbs_mom   20286   root    6u  IPv4  92106      0t0  TCP *:15003 (LISTEN)
>>
>>
>> The gdb backtrace seems to hang on node32 also:
>>
>> [root at node32 ~]# echo bt>   /tmp/gdb_backtrace_command
>> [root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command
>> /opt/torque/active/sbin/pbs_mom 2932
>>
>> No response.
>>
>> [root at node32 ~]# gdb --batch -x /tmp/gdb_backtrace_command
>> /opt/torque/active/sbin/pbs_mom 2933
>>
>> No response.
>>
>> [root at node32 ~]# ps aux |grep gdb
>> root      3400  0.0  0.0 133616  9132 pts/0    S+   13:47   0:00 gdb
>> --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 2932
>> root      3434  0.3  0.0 133616  9136 pts/2    S+   13:57   0:00 gdb
>> --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom 2933
>>
>> I didn't explicitly configure torque "--with-debug" (unless it goes by
>> default). So, I wonder if we can expect anything from gdb.
>>
>> Thank you again for all the help.
>> Gus Correa
>>
>>
>>
>> On 11/11/2013 12:45 PM, Ezell, Matthew A. wrote:
>>> We can see that second pbs_mom shares a socket with the master.
>>>
>>> Hopefully you have GDB available on the compute nodes.  Can you show me
>>> the output of the following commands on node30:
>>>
>>> lsof -i | egrep 122520\|122524
>>> echo bt>    /tmp/gdb_backtrace_command
>>> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
>>> 29185
>>> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
>>> 29187
>>>
>>> gdb --batch -x /tmp/gdb_backtrace_command /opt/torque/active/sbin/pbs_mom
>>> 17941
>>>
>>>
>>> Thanks,
>>> ~Matt
>>>
>>> ---
>>> Matt Ezell
>>> HPC Systems Administrator
>>> Oak Ridge National Laboratory
>>>
>>>
>>>
>>>
>>> On 11/11/13 12:28 PM, "Gus Correa"<gus at ldeo.columbia.edu>    wrote:
>>>
>>>> On 11/11/2013 10:56 AM, Ezell, Matthew A. wrote:
>>>>> Hi Gus-
>>>>>
>>>>> I've seen pbs_mom processes owned by users when interactive jobs are
>>>>> running. The root pbs_mom has to fork() and setuid() to the user to
>>>>> startup an X11 listener (if requested).  In my case they always seem to
>>>>> go
>>>>> away when the job completes.  I assume with the age of these processes,
>>>>> they did *not* go away when the jobs completed.
>>>>>
>>>>> I see that all of your non-root processes are in uninterruptible sleep
>>>>> (D).  Can you tell what they are waiting on?  Getting a GDB backtrace or
>>>>> at minimum the listing of /proc/<pid>/fd would be useful.
>>>>>
>>>>> Good luck,
>>>>> ~Matt
>>>>>
>>>>> ---
>>>>> Matt Ezell
>>>>> HPC Systems Administrator
>>>>> Oak Ridge National Laboratory
>>>>>
>>>>
>>>> Hi Matthew
>>>>
>>>> Thank you for the help!
>>>>
>>>> I am thinking of rolling back to Torque 2.5.X or 2.4.Y,
>>>> but I am not even sure it would solve this nasty problem.
>>>>
>>>> Yes, David Beer told me last time that pbs_mom forks into user owned
>>>> processes, as you also said.
>>>>
>>>> Does this happen only on interactive jobs or also on batch jobs?
>>>> Only if the user requests X11?
>>>>
>>>> However, I thought this would happen very fast, just until the
>>>> user-owned pbs_moms launched the actual user processes.
>>>> In normal conditions I never had a chance to see these user-owner
>>>> pbs_moms with "ps".
>>>>
>>>> Nevertheless, on the cases I reported, the user-owned pbs_mom
>>>> processes stick around for some reason, don't go away after the job ends
>>>> or dies, and seem to break subsequent jobs (which continue to be
>>>> scheduled for the affected node).
>>>>
>>>> I enclose below the listing of /proc/<pbs_mom_procid>/fd for the
>>>> two affected nodes.
>>>> I took those nodes offline,
>>>> but eventually I need to reboot them and put back in production.
>>>>
>>>> The last process listed is the root-owned pbs_mom, and seems to
>>>> still have open the cpuset files of jobs 324 and 537 respectively.
>>>> Those are the jobs that the user was running while he also owned the
>>>> pbs_mom processes, on Oct/23 and Nov/06 respectively.
>>>> Both jobs' stderr and stdout stuck to the nodes' spool,
>>>> and weren't moved to the user work directory.
>>>> Job 324 exceeded the walltime (12h).
>>>> Job 537 seems to have stopped very quickly (2 seconds cputime),
>>>> and its stderr is empty.
>>>>
>>>> [root at node30 ~]# ls -l /opt/torque/active/spool/324.master.ER
>>>> -rw------- 1 ltmurray ltmurray 3026 Oct 23 02:55
>>>> /opt/torque/active/spool/324.master.ER
>>>>
>>>> [root at node30 ~]# tail -n 2 /opt/torque/active/spool/324.master.ER
>>>> % Compiled module: STRREPL.
>>>> =>>    PBS: job killed: walltime 43204 exceeded limit 43200
>>>>
>>>>
>>>> [root at node32 ~]# ls -l /opt/torque/active/spool/537.master.ER
>>>> -rw------- 1 ltmurray ltmurray 0 Nov  6 12:49
>>>> /opt/torque/active/spool/537.master.ER
>>>>
>>>> Their tracejob is enclosed.
>>>>
>>>> This user sometimes runs interactive jobs, probably exporting X
>>>> to get the IDL or Matlab GUI.
>>>> However, these two jobs (324 and 537) seem to be batch [and the
>>>> executable is parallelized with OpenMP (not MPI),
>>>> but this probably doesn't matter].
>>>>
>>>> Thank you for your help,
>>>> Gus Correa
>>>>
>>>> *******************************node30**********************************
>>>> [root at node30 ~]# ps aux |grep pbs_mom
>>>> root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17  19:03
>>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>>> ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>>> ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23   0:00
>>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>>> root     30099  0.0  0.0 103236   892 pts/0    S+   11:39   0:00 grep
>>>> pbs_mom
>>>>
>>>> [root at node30 ~]# ll /proc/29185/fd
>>>> total 0
>>>> lr-x------ 1 root root 64 Nov 11 11:38 0 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:38 1 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:38 2 ->    /dev/null
>>>> lr-x------ 1 root root 64 Nov 11 11:38 7 ->    /
>>>> lrwx------ 1 root root 64 Nov 11 11:38 9 ->    socket:[122520]
>>>>
>>>> [root at node30 ~]# ll /proc/29187/fd
>>>> total 0
>>>> lr-x------ 1 root root 64 Nov 11 11:39 0 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:39 1 ->    /dev/null
>>>> lrwx------ 1 root root 64 Nov 11 11:39 10 ->    socket:[122524]
>>>> l-wx------ 1 root root 64 Nov 11 11:39 2 ->    /dev/null
>>>> lr-x------ 1 root root 64 Nov 11 11:39 7 ->    /
>>>>
>>>> [root at node30 ~]# ll /proc/17941/fd
>>>> total 0
>>>> lr-x------ 1 root root 64 Nov 11 11:52 0 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:52 1 ->    /dev/null
>>>> lrwx------ 1 root root 64 Nov 11 11:52 10 ->    socket:[122524]
>>>> lr-x------ 1 root root 64 Nov 11 11:52 11 ->    /dev/cpuset/torque/324.master
>>>> l-wx------ 1 root root 64 Nov 11 11:52 2 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:52 3 ->
>>>> /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131023
>>>> l-wx------ 1 root root 64 Nov 11 11:52 4 ->
>>>> /opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>>>> lrwx------ 1 root root 64 Nov 11 11:52 5 ->    socket:[81628]
>>>> lrwx------ 1 root root 64 Nov 11 11:52 6 ->    socket:[81629]
>>>> lr-x------ 1 root root 64 Nov 11 11:52 7 ->    /
>>>> lr-x------ 1 root root 64 Nov 11 11:52 8 ->    /proc
>>>> lrwx------ 1 root root 64 Nov 11 11:52 9 ->    socket:[122546]
>>>>
>>>> ***********************************************************************
>>>>
>>>> *******************************node32**********************************
>>>> [root at node32 ~]# ps aux |grep pbs_mom
>>>> ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>>> ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06   0:00
>>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>>> root      3267  0.0  0.0 103236   892 pts/0    S+   11:59   0:00 grep
>>>> pbs_mom
>>>> root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17 109:36
>>>> /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
>>>> [root at node32 ~]# ll /proc/2932/fd
>>>> total 0
>>>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->    /dev/null
>>>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->    /
>>>> lrwx------ 1 root root 64 Nov 11 11:58 9 ->    socket:[305589]
>>>> [root at node32 ~]# ll /proc/2933/fd
>>>> total 0
>>>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->    /dev/null
>>>> lrwx------ 1 root root 64 Nov 11 11:58 10 ->    socket:[305590]
>>>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->    /dev/null
>>>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->    /
>>>> [root at node32 ~]# ll /proc/20286/fd
>>>> total 0
>>>> lr-x------ 1 root root 64 Nov 11 11:58 0 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:58 1 ->    /dev/null
>>>> lrwx------ 1 root root 64 Nov 11 11:58 10 ->    socket:[305590]
>>>> lrwx------ 1 root root 64 Nov 11 11:58 11 ->    socket:[305612]
>>>> l-wx------ 1 root root 64 Nov 11 11:58 2 ->    /dev/null
>>>> l-wx------ 1 root root 64 Nov 11 11:58 3 ->
>>>> /opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131106
>>>> l-wx------ 1 root root 64 Nov 11 11:58 4 ->
>>>> /opt/torque/4.2.5/gnu-4.4.7/mom_priv/mom.lock
>>>> lrwx------ 1 root root 64 Nov 11 11:58 5 ->    socket:[92105]
>>>> lrwx------ 1 root root 64 Nov 11 11:58 6 ->    socket:[92106]
>>>> lr-x------ 1 root root 64 Nov 11 11:58 7 ->    /
>>>> lr-x------ 1 root root 64 Nov 11 11:58 8 ->    /proc
>>>> lr-x------ 1 root root 64 Nov 11 11:58 9 ->    /dev/cpuset/torque/537.master
>>>> ***********************************************************************
>>>>
>>>> ****************************tracejob 324********************************
>>>> Job: 324.master
>>>>
>>>> 10/22/2013 10:48:41  S    enqueuing into production, state 1 hop 1
>>>> 10/22/2013 10:48:41  A    queue=production
>>>> 10/22/2013 14:55:35  S    Dependency on job 323.master released.
>>>> 10/22/2013 14:55:36  S    Job Run at request of maui at master
>>>> 10/22/2013 14:55:36  A    user=ltmurray group=ltmurray
>>>> jobname=GC.Base.1984.03 queue=production ctime=1382453321
>>>> qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
>>>>
>>>> exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>>>> ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>>>> 4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>>>> de30/22+node30/23
>>>>                              Resource_List.neednodes=1:ppn=24
>>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>>> Resource_List.walltime=12:00:00
>>>> 10/23/2013 02:55:46  S    Exit_status=-11 resources_used.cput=66:14:47
>>>> resources_used.mem=4800720kb resources_used.vmem=6672108kb
>>>> resources_used.walltime=12:00:10
>>>> 10/23/2013 02:55:46  S    on_job_exit valid pjob: 324.master (substate=50)
>>>> 10/23/2013 02:55:46  A    user=ltmurray group=ltmurray
>>>> jobname=GC.Base.1984.03 queue=production ctime=1382453321
>>>> qtime=1382453321 etime=1382468135 start=1382468136 owner=ltmurray at master
>>>>
>>>> exec_host=node30/0+node30/1+node30/2+node30/3+node30/4+node30/5+node30/6+n
>>>> ode30/7+node30/8+node30/9+node30/10+node30/11+node30/12+node30/13+node30/1
>>>> 4+node30/15+node30/16+node30/17+node30/18+node30/19+node30/20+node30/21+no
>>>> de30/22+node30/23
>>>>                              Resource_List.neednodes=1:ppn=24
>>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>>> Resource_List.walltime=12:00:00 session=28077 end=1382511346
>>>> Exit_status=-11 resources_used.cput=66:14:47 resources_used.mem=4800720kb
>>>>                              resources_used.vmem=6672108kb
>>>> resources_used.walltime=12:00:10
>>>> 10/23/2013 02:56:13  S    on_job_exit valid pjob: 324.master (substate=52)
>>>> 10/23/2013 03:05:39  S    Request invalid for state of job EXITING
>>>> 10/23/2013 03:05:43  S    Request invalid for state of job EXITING
>>>> 10/23/2013 07:05:52  S    dequeuing from production, state COMPLETE
>>>>
>>>> ************************************************************************
>>>>
>>>>
>>>> *******************************tracejob 537****************************
>>>> Job: 537.master
>>>>
>>>> 11/06/2013 12:48:58  S    enqueuing into production, state 1 hop 1
>>>> 11/06/2013 12:48:58  A    queue=production
>>>> 11/06/2013 12:49:00  S    Job Run at request of maui at master
>>>> 11/06/2013 12:49:00  A    user=ltmurray group=ltmurray
>>>> jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>>>> qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
>>>>
>>>> exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>>>> ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>>>> 4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>>>> de32/22+node32/23
>>>>                              Resource_List.neednodes=1:ppn=24
>>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>>> Resource_List.walltime=12:00:00
>>>> 11/06/2013 13:09:21  S    Job deleted at request of ltmurray at master
>>>> 11/06/2013 13:09:21  S    Job sent signal SIGTERM on delete
>>>> 11/06/2013 13:09:21  A    requestor=ltmurray at master
>>>> 11/06/2013 13:11:06  S    Job sent signal SIGKILL on delete
>>>> 11/06/2013 13:11:06  S    Exit_status=271 resources_used.cput=00:00:02
>>>> resources_used.mem=7480172kb resources_used.vmem=9157780kb
>>>> resources_used.walltime=00:22:06
>>>> 11/06/2013 13:11:06  S    on_job_exit valid pjob: 537.master (substate=50)
>>>> 11/06/2013 13:11:06  A    user=ltmurray group=ltmurray
>>>> jobname=GC.AQAST.2005.01.01 queue=production ctime=1383760138
>>>> qtime=1383760138 etime=1383760138 start=1383760140 owner=ltmurray at master
>>>>
>>>> exec_host=node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+n
>>>> ode32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/1
>>>> 4+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+no
>>>> de32/22+node32/23
>>>>                              Resource_List.neednodes=1:ppn=24
>>>> Resource_List.nodect=1 Resource_List.nodes=1:ppn=24
>>>> Resource_List.walltime=12:00:00 session=2857 end=1383761466
>>>> Exit_status=271 resources_used.cput=00:00:02 resources_used.mem=7480172kb
>>>>                              resources_used.vmem=9157780kb
>>>> resources_used.walltime=00:22:06
>>>> 11/06/2013 13:11:29  S    on_job_exit valid pjob: 537.master (substate=52)
>>>> 11/06/2013 17:21:09  S    dequeuing from production, state COMPLETE
>>>> ***********************************************************************
>>>>
>>>>
>>>>
>>>>> On 11/8/13 3:12 PM, "Gus Correa"<gus at ldeo.columbia.edu>     wrote:
>>>>>
>>>>>> Dear Torque experts
>>>>>>
>>>>>> I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.
>>>>>>
>>>>>> I reported a similar problem a few weeks ago.
>>>>>> It disappeared after the nodes were rebooted, but it is back.
>>>>>>
>>>>>> Some nodes seem to "develop" or create *multiple pbs_mom processes*
>>>>>> after they run for a while.
>>>>>> Those are nodes that are used more heavily
>>>>>> (higher node numbers).
>>>>>> See the printout below, specially nodes 30-32.
>>>>>>
>>>>>> The duplicate pbs_mom processes are owned by a regular user, which is
>>>>>> rather awkward.
>>>>>>
>>>>>> As a result, multi-node jobs that get the affected nodes, fail to run.
>>>>>> I enclose below a tracejob sample of one of those jobs.
>>>>>>
>>>>>> What could be causing this problem, and how can it be solved?
>>>>>>
>>>>>> Thank you,
>>>>>> Gus Correa
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list