[torqueusers] problem Server could not connect to MOM
Daniel
andrzeje at cs.utk.edu
Tue Jul 29 18:48:10 MDT 2008
Hi Josh,
It says that all the nodes are free. I forgot to add that I don't think it's an issue with a firewall.
-bash-3.1# pbsnodes -a
boba1.sinrg.local
state = free
np = 2
ntype = cluster
status = opsys=linux,uname=Linux boba1 2.6.18-53.el5 #1 SMP Mon Nov 12 02:22:48 EST 2007
i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=11289,totmem=4172296kb,availmem=4117160kb,physmem=2075152kb,ncpus=2,loadave=0.00,netload=4499631,state=free,jobs=,varattr=,rectime=1217378257
boba2.sinrg.local
state = free
np = 2
ntype = cluster
status = opsys=linux,uname=Linux boba2 2.6.18-53.el5 #1 SMP Mon Nov 12 02:22:48 EST 2007
i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=392,totmem=4172296kb,availmem=4089808kb,physmem=2075152kb,ncpus=2,loadave=0.00,netload=130531473,state=free,jobs=,varattr=,rectime=1217378292
boba3.sinrg.local
state = free
np = 2
ntype = cluster
status = opsys=linux,uname=Linux boba3 2.6.18-53.el5 #1 SMP Mon Nov 12 02:22:48 EST 2007
i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=11840,totmem=4172296kb,availmem=4116928kb,physmem=2075152kb,ncpus=2,loadave=0.01,netload=1773484,state=free,jobs=,varattr=,rectime=1217378248
boba4.sinrg.local
state = free
np = 2
ntype = cluster
status = opsys=linux,uname=Linux boba4 2.6.18-53.el5 #1 SMP Mon Nov 12 02:22:48 EST 2007
i686,sessions=? 0,nsessions=?
0,nusers=0,idletime=11514,totmem=4172296kb,availmem=4110436kb,physmem=2075152kb,ncpus=2,loadave=0.00,netload=1488085,state=free,jobs=,varattr=,rectime=1217378257
Daniel
--
Joshua Bernstein wrote:
> Hi Daniel,
>
> If you submit a non-interactive job, does that work? What does
> pbsnodes -a say?
>
> -Josh
>
> Daniel Andrzejewski wrote:
>> Hi,
>>
>> I have 1 head node and 4 compute nodes, Torque 2.3.1 and CentOS 5.1.
>>
>> When I submit an interactive job it hangs.
>>
>> How can I trace the problem?
>>
>>
>>
>> andrzeje:boba-head ~> strace qsub -I -l nodes=2:ppn=2
>>
>>
>> execve("/usr/local/bin/qsub", ["qsub", "-I", "-l", "nodes=2:ppn=2"],
>> [/* 35 vars */]) = 0
>> .
>> .
>> .
>> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
>> 0) = 0xb7e8a000
>> write(1, "qsub: waiting for job 78.boba-he"..., 56qsub: waiting for job
>> 78.boba-head.sinrg.local to start
>> ) = 56
>> select(1024, [3], NULL, NULL, {30, 0}
>>
>>
>>
>> -bash-3.1# tail -f /sw/var/torque/server_logs/20080725
>> 07/25/2008
>> 14:45:56;0040;PBS_Server;Svr;boba-head.sinrg.local;Scheduler sent
>> command new
>> 07/25/2008 14:45:57;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job
>> Modified at request
>> of root at boba-head.sinrg.local
>> 07/25/2008 14:45:57;0001;PBS_Server;Req;;Server could not connect to MOM
>> 07/25/2008 14:45:57;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15070(Server could
>> not connect to MOM), aux=0, type=ModifyJob, from
>> root at boba-head.sinrg.local
>> 07/25/2008 14:46:28;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job
>> Modified at request
>> of root at boba-head.sinrg.local
>> 07/25/2008 14:46:28;0001;PBS_Server;Req;;Server could not connect to MOM
>> 07/25/2008 14:46:28;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15070(Server could
>> not connect to MOM), aux=0, type=ModifyJob, from
>> root at boba-head.sinrg.local
>> 07/25/2008 14:46:59;0008;PBS_Server;Job;78.boba-head.sinrg.local;Job
>> Modified at request
>> of root at boba-head.sinrg.local
>> 07/25/2008 14:46:59;0001;PBS_Server;Req;;Server could not connect to MOM
>> 07/25/2008 14:46:59;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15070(Server could
>> not connect to MOM), aux=0, type=ModifyJob, from
>> root at boba-head.sinrg.local
>>
>>
>>
>>
>> -bash-3.1# showq
>> ACTIVE JOBS--------------------
>> JOBNAME USERNAME STATE PROC REMAINING
>> STARTTIME
>>
>>
>> 0 Active Jobs 0 of 8 Processors Active (0.00%)
>> 0 of 4 Nodes Active (0.00%)
>>
>> IDLE JOBS----------------------
>> JOBNAME USERNAME STATE PROC WCLIMIT
>> QUEUETIME
>>
>> 76 andrzeje Idle 4 4:00:00 Fri Jul 25
>> 14:38:39
>>
>> 1 Idle Job
>>
>> BLOCKED JOBS----------------
>> JOBNAME USERNAME STATE PROC WCLIMIT
>> QUEUETIME
>>
>>
>> Total Jobs: 1 Active Jobs: 0 Idle Jobs: 1 Blocked Jobs: 0
>>
>>
>> -bash-3.1# dsh -g boba ps -eaf | grep pbs
>> boba1: root 9491 1 0 14:03 ? 00:00:00
>> /usr/local/sbin/pbs_mom
>> boba2: root 6733 1 0 14:03 ? 00:00:00
>> /usr/local/sbin/pbs_mom
>> boba3: root 6941 1 0 14:03 ? 00:00:00
>> /usr/local/sbin/pbs_mom
>> boba4: root 4040 1 0 14:17 ? 00:00:00
>> /usr/local/sbin/pbs_mom
>>
>>
>> -bash-3.1# ps -eaf | grep pbs
>> root 31789 1 0 14:03 ? 00:00:00
>> /usr/local/sbin/pbs_server
>> root 31987 31211 0 14:41 pts/2 00:00:00 grep pbs
>>
>>
>> -bash-3.1# ps -eaf | grep maui
>> root 31792 1 0 14:03 ? 00:00:00 /usr/local/sbin/maui
>> root 31989 31211 0 14:41 pts/2 00:00:00 grep maui
>>
>>
>> Thanks,
>>
>> Daniel
--
Daniel Andrzejewski
student IT Administrator
Elec Engr & Comp Science
University of Tennessee
(865) 974 - 4388 (work)
"Investment in knowledge always pays the best interest" Benjamin Franklin
--
More information about the torqueusers
mailing list