[torqueusers] problem Server could not connect to MOM

Joshua Bernstein jbernstein at penguincomputing.com
Fri Aug 8 12:21:26 MDT 2008


What happens if you submit a non-interactive job? Do you have pbs_sched 
running?

-Josh

Daniel wrote:
> Hi Josh,
> 
> It says that all the nodes are free. I forgot to add that I don't think 
> it's an issue with a firewall.
> 
> -bash-3.1# pbsnodes -a
> boba1.sinrg.local
>      state = free
>      np = 2
>      ntype = cluster
>      status = opsys=linux,uname=Linux boba1 2.6.18-53.el5 #1 SMP Mon Nov 
> 12 02:22:48 EST 2007 i686,sessions=? 0,nsessions=? 
> 0,nusers=0,idletime=11289,totmem=4172296kb,availmem=4117160kb,physmem=2075152kb,ncpus=2,loadave=0.00,netload=4499631,state=free,jobs=,varattr=,rectime=1217378257 
> 
> 
> boba2.sinrg.local
>      state = free
>      np = 2
>      ntype = cluster
>      status = opsys=linux,uname=Linux boba2 2.6.18-53.el5 #1 SMP Mon Nov 
> 12 02:22:48 EST 2007 i686,sessions=? 0,nsessions=? 
> 0,nusers=0,idletime=392,totmem=4172296kb,availmem=4089808kb,physmem=2075152kb,ncpus=2,loadave=0.00,netload=130531473,state=free,jobs=,varattr=,rectime=1217378292 
> 
> 
> boba3.sinrg.local
>      state = free
>      np = 2
>      ntype = cluster
>      status = opsys=linux,uname=Linux boba3 2.6.18-53.el5 #1 SMP Mon Nov 
> 12 02:22:48 EST 2007 i686,sessions=? 0,nsessions=? 
> 0,nusers=0,idletime=11840,totmem=4172296kb,availmem=4116928kb,physmem=2075152kb,ncpus=2,loadave=0.01,netload=1773484,state=free,jobs=,varattr=,rectime=1217378248 
> 
> 
> boba4.sinrg.local
>      state = free
>      np = 2
>      ntype = cluster
>      status = opsys=linux,uname=Linux boba4 2.6.18-53.el5 #1 SMP Mon Nov 
> 12 02:22:48 EST 2007 i686,sessions=? 0,nsessions=? 
> 0,nusers=0,idletime=11514,totmem=4172296kb,availmem=4110436kb,physmem=2075152kb,ncpus=2,loadave=0.00,netload=1488085,state=free,jobs=,varattr=,rectime=1217378257 
> 
> 
> 
> Daniel
> -- 
> Joshua Bernstein wrote:
>> Hi Daniel,
>>
>>     If you submit a non-interactive job, does that work? What does 
>> pbsnodes -a say?
>>
>> -Josh
>>
>> Daniel Andrzejewski wrote:
>>> Hi,
>>>
>>> I have 1 head node and 4 compute nodes, Torque 2.3.1 and CentOS 5.1.
>>>
>>> When I submit an interactive job it hangs.
>>>
>>> How can I trace the problem?
>>>
>>>
>>>
>>> andrzeje:boba-head ~> strace qsub -I -l nodes=2:ppn=2
>>>
>>>
>>> execve("/usr/local/bin/qsub", ["qsub", "-I", "-l", "nodes=2:ppn=2"], 
>>> [/* 35 vars */]) = 0
>>> .
>>> .
>>> .
>>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 
>>> -1, 0) = 0xb7e8a000
>>> write(1, "qsub: waiting for job 78.boba-he"..., 56qsub: waiting for job
>>> 78.boba-head.sinrg.local to start
>>> ) = 56
>>> select(1024, [3], NULL, NULL, {30, 0}
>>>
>>>
>>>
>>> -bash-3.1# tail -f /sw/var/torque/server_logs/20080725
>>> 07/25/2008 
>>> 14:45:56;0040;PBS_Server;Svr;boba-head.sinrg.local;Scheduler sent 
>>> command new
>>> 07/25/2008 14:45:57;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job 
>>> Modified at request
>>> of root at boba-head.sinrg.local
>>> 07/25/2008 14:45:57;0001;PBS_Server;Req;;Server could not connect to MOM
>>> 07/25/2008 14:45:57;0080;PBS_Server;Req;req_reject;Reject reply 
>>> code=15070(Server could
>>> not connect to MOM), aux=0, type=ModifyJob, from 
>>> root at boba-head.sinrg.local
>>> 07/25/2008 14:46:28;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job 
>>> Modified at request
>>> of root at boba-head.sinrg.local
>>> 07/25/2008 14:46:28;0001;PBS_Server;Req;;Server could not connect to MOM
>>> 07/25/2008 14:46:28;0080;PBS_Server;Req;req_reject;Reject reply 
>>> code=15070(Server could
>>> not connect to MOM), aux=0, type=ModifyJob, from 
>>> root at boba-head.sinrg.local
>>> 07/25/2008 14:46:59;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job 
>>> Modified at request
>>> of root at boba-head.sinrg.local
>>> 07/25/2008 14:46:59;0001;PBS_Server;Req;;Server could not connect to MOM
>>> 07/25/2008 14:46:59;0080;PBS_Server;Req;req_reject;Reject reply 
>>> code=15070(Server could
>>> not connect to MOM), aux=0, type=ModifyJob, from 
>>> root at boba-head.sinrg.local
>>>
>>>
>>>
>>>
>>> -bash-3.1# showq
>>> ACTIVE JOBS--------------------
>>> JOBNAME            USERNAME      STATE  PROC   REMAINING            
>>> STARTTIME
>>>
>>>
>>>      0 Active Jobs       0 of    8 Processors Active (0.00%)
>>>                          0 of    4 Nodes Active      (0.00%)
>>>
>>> IDLE JOBS----------------------
>>> JOBNAME            USERNAME      STATE  PROC     WCLIMIT            
>>> QUEUETIME
>>>
>>> 76                 andrzeje       Idle     4     4:00:00  Fri Jul 25 
>>> 14:38:39
>>>
>>> 1 Idle Job
>>>
>>> BLOCKED JOBS----------------
>>> JOBNAME            USERNAME      STATE  PROC     WCLIMIT            
>>> QUEUETIME
>>>
>>>
>>> Total Jobs: 1   Active Jobs: 0   Idle Jobs: 1   Blocked Jobs: 0
>>>
>>>
>>> -bash-3.1# dsh -g boba ps -eaf | grep pbs
>>> boba1: root      9491     1  0 14:03 ?        00:00:00 
>>> /usr/local/sbin/pbs_mom
>>> boba2: root      6733     1  0 14:03 ?        00:00:00 
>>> /usr/local/sbin/pbs_mom
>>> boba3: root      6941     1  0 14:03 ?        00:00:00 
>>> /usr/local/sbin/pbs_mom
>>> boba4: root      4040     1  0 14:17 ?        00:00:00 
>>> /usr/local/sbin/pbs_mom
>>>
>>>
>>> -bash-3.1# ps -eaf | grep pbs
>>> root     31789     1  0 14:03 ?        00:00:00 
>>> /usr/local/sbin/pbs_server
>>> root     31987 31211  0 14:41 pts/2    00:00:00 grep pbs
>>>
>>>
>>> -bash-3.1# ps -eaf | grep maui
>>> root     31792     1  0 14:03 ?        00:00:00 /usr/local/sbin/maui
>>> root     31989 31211  0 14:41 pts/2    00:00:00 grep maui
>>>
>>>
>>> Thanks,
>>>
>>> Daniel
> 
> 


More information about the torqueusers mailing list