[torqueusers] Multiple mom processes, owned by regular user (Torque 4.2.5, Maui 3.3.1)

Gus Correa gus at ldeo.columbia.edu
Fri Nov 8 13:12:19 MST 2013


Dear Torque experts

I am using Torque 4.2.5 and Maui 3.3.1 on a 34-node cluster.

I reported a similar problem a few weeks ago.
It disappeared after the nodes were rebooted, but it is back.

Some nodes seem to "develop" or create *multiple pbs_mom processes*
after they run for a while.
Those are nodes that are used more heavily
(higher node numbers).
See the printout below, specially nodes 30-32.

The duplicate pbs_mom processes are owned by a regular user, which is
rather awkward.

As a result, multi-node jobs that get the affected nodes, fail to run.
I enclose below a tracejob sample of one of those jobs.

What could be causing this problem, and how can it be solved?

Thank you,
Gus Correa

###################################################################

node29: 	root     17789  0.2  0.0  88192 34816 ?        SLsl Oct17 
68:28 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node29: 	root     28902  0.0  0.0 106056  1476 ?        Ss   15:00 
0:00 bash -c ps aux |grep pbs_mom
node29: 	root     28918  0.0  0.0 103236   888 ?        S    15:00 
0:00 grep pbs_mom
node30: 	root     17941  0.0  0.0  88160 34780 ?        SLsl Oct17 
18:51 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node30: 	ltmurray 29185  0.0  0.0  88156 26024 ?        D    Oct23 
0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node30: 	ltmurray 29187  0.0  0.0  88156 26024 ?        D    Oct23 
0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node30: 	root     29696  0.0  0.0 106056  1476 ?        Ss   15:00 
0:00 bash -c ps aux |grep pbs_mom
node30: 	root     29712  0.0  0.0 103236   892 ?        S    15:00 
0:00 grep pbs_mom
node31: 	root       908  0.0  0.0 106056  1476 ?        Ss   15:00 
0:00 bash -c ps aux |grep pbs_mom
node31: 	root       925  0.0  0.0 103236   892 ?        R    15:00 
0:00 grep pbs_mom
node31: 	root     18675  0.2  0.0  88192 34812 ?        SLsl Oct17 
82:53 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node32: 	ltmurray  2932  0.0  0.0  87932 25804 ?        D    Nov06 
0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node32: 	ltmurray  2933  0.0  0.0  87932 25804 ?        D    Nov06 
0:00 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node32: 	root      3006  0.0  0.0 106056  1472 ?        Ss   15:00 
0:00 bash -c ps aux |grep pbs_mom
node32: 	root      3022  0.0  0.0 103236   888 ?        S    15:00 
0:00 grep pbs_mom
node32: 	root     20286  0.3  0.0  87936 34556 ?        SLsl Oct17 
109:23 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node33: 	root      8487  0.2  0.0  87940 34560 ?        SLsl Oct17 
88:30 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active
node33: 	root     31855  0.0  0.0 106056  1476 ?        Ss   15:00 
0:00 bash -c ps aux |grep pbs_mom
node33: 	root     31871  0.0  0.0 103236   892 ?        S    15:00 
0:00 grep pbs_mom
node34: 	root     13010  0.0  0.0 106056  1476 ?        Ss   15:00 
0:00 bash -c ps aux |grep pbs_mom
node34: 	root     13026  0.0  0.0 103236   888 ?        S    15:00 
0:00 grep pbs_mom
node34: 	root     17071  0.3  0.0  88160 34784 ?        SLsl Oct17 
110:01 /opt/torque/active/sbin/pbs_mom -p -d /opt/torque/active




###################################################################


# tracejob 575

/opt/torque/4.2.5/gnu-4.4.7/mom_logs/20131108: No such file or directory
/opt/torque/4.2.5/gnu-4.4.7/sched_logs/20131108: No such file or directory

Job: 575.master

11/08/2013 12:18:09  S    enqueuing into production, state 1 hop 1
11/08/2013 12:18:09  A    queue=production
11/08/2013 12:18:10  S    Job Run at request of maui at master
11/08/2013 12:18:10  S    Not sending email: User does not want mail of 
this type.
11/08/2013 12:18:10  A    user=sw2526 group=sw2526 jobname=test1 
queue=production ctime=1383931089 qtime=1383931089 etime=1383931089 
start=1383931090 owner=sw2526 at brewer.ldeo.columbia.edu
 
exec_host=node33/0+node33/1+node33/2+node33/3+node33/4+node33/5+node33/6+node33/7+node33/8+node33/9+node33/10+node33/11+node33/12+node33/13+node33/14+node33/15+node33/16+node33/17+node33/18+node33/19+node33/20+node33/21+node33/22+node33/23+node33/24+node33/25+node33/26+node33/27+node33/28+node33/29+node33/30+node33/31+node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
                           Resource_List.neednodes=2:ppn=32 
Resource_List.nodect=2 Resource_List.nodes=2:ppn=32 
Resource_List.walltime=12:00:00
11/08/2013 12:27:03  S    DeleteJob issue_Drequest failure, rc = 15033
11/08/2013 12:27:04  S    Job Run at request of maui at master
11/08/2013 12:27:06  S    Not sending email: User does not want mail of 
this type.
11/08/2013 12:27:06  A    user=sw2526 group=sw2526 jobname=test1 
queue=production ctime=1383931089 qtime=1383931089 etime=1383931089 
start=1383931626 owner=sw2526 at brewer.ldeo.columbia.edu
 
exec_host=node33/0+node33/1+node33/2+node33/3+node33/4+node33/5+node33/6+node33/7+node33/8+node33/9+node33/10+node33/11+node33/12+node33/13+node33/14+node33/15+node33/16+node33/17+node33/18+node33/19+node33/20+node33/21+node33/22+node33/23+node33/24+node33/25+node33/26+node33/27+node33/28+node33/29+node33/30+node33/31+node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
                           Resource_List.neednodes=2:ppn=32 
Resource_List.nodect=2 Resource_List.nodes=2:ppn=32 
Resource_List.walltime=12:00:00
11/08/2013 12:32:03  S    Exit_status=0 resources_used.cput=00:00:00 
resources_used.mem=0kb resources_used.vmem=0kb 
resources_used.walltime=00:00:00
11/08/2013 12:32:03  S    Not sending email: User does not want mail of 
this type.
11/08/2013 12:32:03  S    on_job_exit valid pjob: 575.master (substate=50)
11/08/2013 12:32:03  A    user=sw2526 group=sw2526 jobname=test1 
queue=production ctime=1383931089 qtime=1383931089 etime=1383931089 
start=1383931626 owner=sw2526 at brewer.ldeo.columbia.edu
 
exec_host=node33/0+node33/1+node33/2+node33/3+node33/4+node33/5+node33/6+node33/7+node33/8+node33/9+node33/10+node33/11+node33/12+node33/13+node33/14+node33/15+node33/16+node33/17+node33/18+node33/19+node33/20+node33/21+node33/22+node33/23+node33/24+node33/25+node33/26+node33/27+node33/28+node33/29+node33/30+node33/31+node32/0+node32/1+node32/2+node32/3+node32/4+node32/5+node32/6+node32/7+node32/8+node32/9+node32/10+node32/11+node32/12+node32/13+node32/14+node32/15+node32/16+node32/17+node32/18+node32/19+node32/20+node32/21+node32/22+node32/23+node32/24+node32/25+node32/26+node32/27+node32/28+node32/29+node32/30+node32/31
                           Resource_List.neednodes=2:ppn=32 
Resource_List.nodect=2 Resource_List.nodes=2:ppn=32 
Resource_List.walltime=12:00:00 session=0 end=1383931923 Exit_status=0 
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
                           resources_used.walltime=00:00:00


More information about the torqueusers mailing list