[torqueusers] problem Server could not connect to MOM

Joshua Bernstein jbernstein at penguincomputing.com
Mon Jul 28 14:27:10 MDT 2008


Hi Daniel,

	If you submit a non-interactive job, does that work? What does pbsnodes 
-a say?

-Josh

Daniel Andrzejewski wrote:
> Hi,
> 
> I have 1 head node and 4 compute nodes, Torque 2.3.1 and CentOS 5.1.
> 
> When I submit an interactive job it hangs.
> 
> How can I trace the problem?
> 
> 
> 
> andrzeje:boba-head ~> strace qsub -I -l nodes=2:ppn=2
> 
> 
> execve("/usr/local/bin/qsub", ["qsub", "-I", "-l", "nodes=2:ppn=2"], [/* 35 vars */]) = 0
> .
> .
> .
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e8a000
> write(1, "qsub: waiting for job 78.boba-he"..., 56qsub: waiting for job
> 78.boba-head.sinrg.local to start
> ) = 56
> select(1024, [3], NULL, NULL, {30, 0}
> 
> 
> 
> -bash-3.1# tail -f /sw/var/torque/server_logs/20080725
> 07/25/2008 14:45:56;0040;PBS_Server;Svr;boba-head.sinrg.local;Scheduler sent command new
> 07/25/2008 14:45:57;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job Modified at request
> of root at boba-head.sinrg.local
> 07/25/2008 14:45:57;0001;PBS_Server;Req;;Server could not connect to MOM
> 07/25/2008 14:45:57;0080;PBS_Server;Req;req_reject;Reject reply code=15070(Server could
> not connect to MOM), aux=0, type=ModifyJob, from root at boba-head.sinrg.local
> 07/25/2008 14:46:28;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job Modified at request
> of root at boba-head.sinrg.local
> 07/25/2008 14:46:28;0001;PBS_Server;Req;;Server could not connect to MOM
> 07/25/2008 14:46:28;0080;PBS_Server;Req;req_reject;Reject reply code=15070(Server could
> not connect to MOM), aux=0, type=ModifyJob, from root at boba-head.sinrg.local
> 07/25/2008 14:46:59;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job Modified at request
> of root at boba-head.sinrg.local
> 07/25/2008 14:46:59;0001;PBS_Server;Req;;Server could not connect to MOM
> 07/25/2008 14:46:59;0080;PBS_Server;Req;req_reject;Reject reply code=15070(Server could
> not connect to MOM), aux=0, type=ModifyJob, from root at boba-head.sinrg.local
> 
> 
> 
> 
> -bash-3.1# showq
> ACTIVE JOBS--------------------
> JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
> 
> 
>      0 Active Jobs       0 of    8 Processors Active (0.00%)
>                          0 of    4 Nodes Active      (0.00%)
> 
> IDLE JOBS----------------------
> JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
> 
> 76                 andrzeje       Idle     4     4:00:00  Fri Jul 25 14:38:39
> 
> 1 Idle Job
> 
> BLOCKED JOBS----------------
> JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
> 
> 
> Total Jobs: 1   Active Jobs: 0   Idle Jobs: 1   Blocked Jobs: 0
> 
> 
> -bash-3.1# dsh -g boba ps -eaf | grep pbs
> boba1: root      9491     1  0 14:03 ?        00:00:00 /usr/local/sbin/pbs_mom
> boba2: root      6733     1  0 14:03 ?        00:00:00 /usr/local/sbin/pbs_mom
> boba3: root      6941     1  0 14:03 ?        00:00:00 /usr/local/sbin/pbs_mom
> boba4: root      4040     1  0 14:17 ?        00:00:00 /usr/local/sbin/pbs_mom
> 
> 
> -bash-3.1# ps -eaf | grep pbs
> root     31789     1  0 14:03 ?        00:00:00 /usr/local/sbin/pbs_server
> root     31987 31211  0 14:41 pts/2    00:00:00 grep pbs
> 
> 
> -bash-3.1# ps -eaf | grep maui
> root     31792     1  0 14:03 ?        00:00:00 /usr/local/sbin/maui
> root     31989 31211  0 14:41 pts/2    00:00:00 grep maui
> 
> 
> Thanks,
> 
> Daniel


More information about the torqueusers mailing list