[torqueusers] pbs_mom LogErr
Zhoulufeng
zhoulufeng at aol.com
Thu Sep 13 22:50:44 MDT 2012
Hi
Give me a hand, thanks.
II.Question
torque-4.1.2.tar.gz and maui-3.3.1.tar.gz was installed to the Server-Node, then I used ssh to log in Compute-Node at Server-Node, and I installed the torque with the sh-scripts made by command [make packages], succeed, also logged in Server-Node at Compute-Node!
all installs are complete,and I run [pbs_mom,maui,pbs_server,pbs_schled] services at Server-Node and [pbs_mom] at Compute-Node.And then, I used command [ps -A | grep XXX] to look for these services, yes, they were all there. and I type command [qnodes] the two nodes server-node and compute-node state is free, but after a while the compute-node state is changed to down. and I compared the mom_logs on compute-node and server-node, the compute-node mom_logs has many errors, and I attempted to solve this, but failed !
and I can ping to Compute-Node and Server-Node.
III.mom_logs
09/14/2012 12:47:26;0002; pbs_mom.13033;Svr;Log;Log opened
09/14/2012 12:47:26;0002; pbs_mom.13033;Svr;pbs_mom;Torque Mom Version = 4.1.2, loglevel = 0
09/14/2012 12:47:26;0002; pbs_mom.13034;Svr;setpbsserver;Compute-Node
09/14/2012 12:47:26;0002; pbs_mom.13034;Svr;mom_server_add;server Compute-Node added
09/14/2012 12:47:26;0002; pbs_mom.13034;Svr;usecp;Compute-Node:/home /home
09/14/2012 12:47:26;0002; pbs_mom.13034;n/a;initialize;independent
09/14/2012 12:47:26;0080; pbs_mom.13034;Svr;pbs_mom;before init_abort_jobs
09/14/2012 12:47:26;0002; pbs_mom.13034;Svr;pbs_mom;Is up
09/14/2012 12:47:26;0002; pbs_mom.13034;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/torque-4.1.2/sbin/pbs_mom 1347352632
09/14/2012 12:47:26;0002; pbs_mom.13034;Svr;pbs_mom;Torque Mom Version = 4.1.2, loglevel = 0
09/14/2012 12:47:26;0001; pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.40.31:15001]
09/14/2012 12:47:26;0001; pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Cannot get a valid stream to send update to server 'Compute-Node'
09/14/2012 12:47:30;0001; pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.40.31:15001]
09/14/2012 12:47:30;0001; pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Cannot get a valid stream to send update to server 'Compute-Node'
09/14/2012 12:47:34;0001; pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.40.31:15001]
I,env:
1,two PCs
hostname: Server-Node
IP: 192.168.40.34
OS: CentOS 6.3 x86_64
hostname: Compute-Node
IP: 192.168.40.31
OS: CentOS 6.3 x86_64
2,Server-Node
/etc/hosts
-----------------------
127.0.0.1 localhost localhost localhost4 localhost6
::1 localhost localhost localhost4 localhost6
192.168.40.34 Server-Node Server-Node
192.168.40.31 Compute-Node Compute-Node
-----------------------
/etc/resol.conf
-----------------------
nameserver 192.168.40.1
-----------------------
note: my net is DHCP, but the two PCs IP is set to static, and when I type[ipconfig /all] on another PC which is winXP with DHCP IP configure in the net, the DNS is 192.168.40.1, so I set the two PC's DNS to 192.168.40.1.
3,Compute-Node
/etc/hosts
-----------------------
127.0.0.1 localhost localhost localhost4 localhost6
::1 localhost localhost localhost4 localhost6
192.168.40.34 Server-Node Server-Node
192.168.40.31 Compute-Node Compute-Node
-----------------------
/etc/resol.conf
-----------------------
nameserver 192.168.40.1
-----------------------
Thanks!
Yours
Zhou
--------------------------------------------
zhoulufeng at aol.com
--------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120914/a0411d0a/attachment-0001.html
More information about the torqueusers
mailing list