[torqueusers] pbs_mom LogErr

Zhoulufeng zhoulufeng at aol.com
Thu Sep 13 22:50:44 MDT 2012



Hi


Give me a hand, thanks.



II.Question
   torque-4.1.2.tar.gz and maui-3.3.1.tar.gz was installed to the Server-Node, then I used ssh to log in Compute-Node at Server-Node, and I installed the torque with the sh-scripts made by command [make packages], succeed, also logged in Server-Node at Compute-Node! 
  all installs are complete,and I run [pbs_mom,maui,pbs_server,pbs_schled] services  at Server-Node and [pbs_mom] at Compute-Node.And then, I used command [ps -A | grep XXX] to look for these services, yes, they were all there. and I type command [qnodes] the two nodes server-node and compute-node state is free, but after a while the compute-node state is changed to down. and I compared the mom_logs on compute-node and server-node, the compute-node mom_logs has many errors, and I attempted to solve this, but failed !
   and I can ping to Compute-Node and Server-Node.
III.mom_logs  

09/14/2012 12:47:26;0002;   pbs_mom.13033;Svr;Log;Log opened
09/14/2012 12:47:26;0002;   pbs_mom.13033;Svr;pbs_mom;Torque Mom Version = 4.1.2, loglevel = 0
09/14/2012 12:47:26;0002;   pbs_mom.13034;Svr;setpbsserver;Compute-Node
09/14/2012 12:47:26;0002;   pbs_mom.13034;Svr;mom_server_add;server Compute-Node added
09/14/2012 12:47:26;0002;   pbs_mom.13034;Svr;usecp;Compute-Node:/home /home
09/14/2012 12:47:26;0002;   pbs_mom.13034;n/a;initialize;independent
09/14/2012 12:47:26;0080;   pbs_mom.13034;Svr;pbs_mom;before init_abort_jobs
09/14/2012 12:47:26;0002;   pbs_mom.13034;Svr;pbs_mom;Is up
09/14/2012 12:47:26;0002;   pbs_mom.13034;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/torque-4.1.2/sbin/pbs_mom 1347352632
09/14/2012 12:47:26;0002;   pbs_mom.13034;Svr;pbs_mom;Torque Mom Version = 4.1.2, loglevel = 0
09/14/2012 12:47:26;0001;   pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.40.31:15001]
09/14/2012 12:47:26;0001;   pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Cannot get a valid stream to send update to server 'Compute-Node'
09/14/2012 12:47:30;0001;   pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.40.31:15001]
09/14/2012 12:47:30;0001;   pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Cannot get a valid stream to send update to server 'Compute-Node'
09/14/2012 12:47:34;0001;   pbs_mom.13034;Svr;pbs_mom;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.40.31:15001]



I,env:
1,two PCs
    hostname: Server-Node 
    IP: 192.168.40.34
    OS: CentOS 6.3 x86_64
    
   hostname: Compute-Node
   IP: 192.168.40.31
   OS: CentOS 6.3 x86_64


2,Server-Node
    /etc/hosts
-----------------------
127.0.0.1 localhost localhost localhost4 localhost6
::1 localhost localhost localhost4 localhost6
192.168.40.34 Server-Node Server-Node
192.168.40.31 Compute-Node Compute-Node
-----------------------
  /etc/resol.conf

-----------------------
nameserver 192.168.40.1
-----------------------
note: my net is DHCP, but the two PCs IP is set to static, and when I type[ipconfig /all] on another PC which is winXP with DHCP IP configure in the net, the DNS is 192.168.40.1, so I set the two PC's DNS to 192.168.40.1.  



3,Compute-Node
    /etc/hosts
-----------------------
127.0.0.1 localhost localhost localhost4 localhost6
::1 localhost localhost localhost4 localhost6
192.168.40.34 Server-Node Server-Node
192.168.40.31 Compute-Node Compute-Node
-----------------------
  /etc/resol.conf

-----------------------
nameserver 192.168.40.1
-----------------------










Thanks!
Yours
Zhou
--------------------------------------------
zhoulufeng at aol.com

--------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120914/a0411d0a/attachment-0001.html 


More information about the torqueusers mailing list