[torqueusers] Jobs not terminating

Tom Combs combs at magnet.fsu.edu
Mon Apr 3 15:21:35 MDT 2006


Garrack,  When I run pbs_mom -S 15001 on the nodes then the problem goes
away.  I really don't know why the mom was trying to use 15000 because I
certainly didn't tell it too.  Thanks for your help.  --Tom Combs


Garrick Staples wrote:

>On Fri, Mar 31, 2006 at 09:50:42AM -0500, Tom Combs alleged:
>  
>
>>Garrick, Thanks for your reply. I'm going to start over with my current
>>status, thus no included responses....
>>
>>Here's the problem:  I can submit jobs, they run with all the user 
>>specified tasks
>>being performed, but then hang at job termination.  The out and in files 
>>remain
>>in the spool directory and qstat shows the job as runnng. The owner of 
>>the job
>>has no processes running on the mom node. The repeated error in the 
>>mom_log is:
>>
>>pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a,
>>  port 15000,jobid 0.cmt errno 111
>>    
>>
>
>Translating c000000a to a dotted-quad, that's 192.0.0.10.  So this isn't
>the "localhost" problem described earlier.
>
>errno 111 is "connection refused".  Can the nodes reach port 15000 on
>192.0.0.10?  Try running 'qstat' or 'qmgr' from the node.  FC4 has port
>filtering rules by default.  'service iptables stop' on 192.0.0.10 might
>fix this.
>
> 
>However, on closer inspection, port 15000 looks wrong.  That should be
>15001.  Is pbs_mom started with -S?  Do 'getent services pbs/tcp' on a
>node, it should print nothing or 15001.
>
>
>  
>
>>The are no unusual messages in the server_log. Following is other info, let
>>me know if there is anything else that is needed. This 65 node cluster 
>>of P4
>>computers has been running fine for 3 years. I had issues on the master node
>>and ended up doing a fresh install going to fedora4 from RH9 on it. Now I
>>have my problem.  Thanks for the help.   -Tom Combs
>>
>>*** Version & build:
>>torque-2.0.0p8
>>configure --prefix=/usr/local/torque --set-server-home=/opt/torque 
>>--with-scp
>>   --set-default-server=cmt --enable-docs
>>Fedora Core 4 on master, Redhad 9 on compute nodes
>>    
>>
>
>This was built seperately on both OSes, right?  You did an FC4 build and
>a RH9 build?  I don't know if that is a factor, but it seems like the
>right thing to do.
>
>Please verify that /opt/torque/server_name contains "cmt" on the server
>and all nodes.  Please verify the nodes don't have $clienthost,
>$pbsserverhost, $restricted, etc. in /opt/torque/mom_priv/config.
>
>Does /opt/torque/server_priv/torque.cfg have anything on the server
>host?
>
> 
>  
>
>>*** iptables and selinux off on both the master and mom node
>>*** hostbased auth working for users both ways between the master and 
>>mom node.
>>    
>>
>
>iptables is off.  Ok.  The problem is the port number.
>
> 
>  
>
>>*** momctl -d 3    from the node (node-2):
>>[root at node-2 sbin]# ./momctl -d 3
>>Host: node-2/node-2   Version: 2.0.0p8
>>Server[0]: cmt (connection is active)
>> Init Msgs Received:     1 hellos/1 cluster-addrs
>> Init Msgs Sent:         42 hellos
>> Last Msg From Server:   8 seconds (StatusJob)
>> Last Msg To Server:     40 seconds
>>PID:                    14719
>>HomeDirectory:          /opt/torque/mom_priv
>>MOM active:             59013 seconds
>>Server Update Interval: 45 seconds
>>LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
>>    
>>
>
>Might want to bump the loglevel, you might see something more.
>
>
>  
>
>>Communication Model:    RPP
>>TCP Timeout:            20 seconds
>>NOTE:  no prolog configured
>>Alarm Time:             0 of 10 seconds
>>Trusted Client List:    192.0.0.10,192.0.0.12,127.0.0.1
>>    
>>
>
>Your client-list is still empty.  Did 'pbsnodes -r node-2' or restarting
>pbs_server fix that?
>
>
>  
>
>>****  netstat on master (abbreviated):
>>[root at cmt init.d]# netstat -ntlp
>>Active Internet connections (only servers)
>>Proto Recv-Q Send-Q Local Address               Foreign 
>>Address             State       PID/Program name
>>tcp        0      0 0.0.0.0:15001               
>>0.0.0.0:*                   LISTEN      16205/pbs_server
>>tcp        0      0 146.201.237.110:15004       
>>0.0.0.0:*                   LISTEN      16201/pbs_sched
>>    
>>
>
>pbs_server is on the right port.  Odd that MOMs are trying to use the
>wrong port.
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>  
>


-- 
Tom Combs                                  E-mail: combs at magnet.fsu.edu
National High Magnetic Field Laboratory    Phone: (850) 644-1657
1800 E. Paul Dirac Drive                   Tallahassee, FL 32310

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060403/f2f2e82a/attachment.html


More information about the torqueusers mailing list