[torqueusers] Jobs not terminating
Tom Combs
combs at magnet.fsu.edu
Mon Apr 3 15:21:35 MDT 2006
Garrack, When I run pbs_mom -S 15001 on the nodes then the problem goes
away. I really don't know why the mom was trying to use 15000 because I
certainly didn't tell it too. Thanks for your help. --Tom Combs
Garrick Staples wrote:
>On Fri, Mar 31, 2006 at 09:50:42AM -0500, Tom Combs alleged:
>
>
>>Garrick, Thanks for your reply. I'm going to start over with my current
>>status, thus no included responses....
>>
>>Here's the problem: I can submit jobs, they run with all the user
>>specified tasks
>>being performed, but then hang at job termination. The out and in files
>>remain
>>in the spool directory and qstat shows the job as runnng. The owner of
>>the job
>>has no processes running on the mom node. The repeated error in the
>>mom_log is:
>>
>>pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a,
>> port 15000,jobid 0.cmt errno 111
>>
>>
>
>Translating c000000a to a dotted-quad, that's 192.0.0.10. So this isn't
>the "localhost" problem described earlier.
>
>errno 111 is "connection refused". Can the nodes reach port 15000 on
>192.0.0.10? Try running 'qstat' or 'qmgr' from the node. FC4 has port
>filtering rules by default. 'service iptables stop' on 192.0.0.10 might
>fix this.
>
>
>However, on closer inspection, port 15000 looks wrong. That should be
>15001. Is pbs_mom started with -S? Do 'getent services pbs/tcp' on a
>node, it should print nothing or 15001.
>
>
>
>
>>The are no unusual messages in the server_log. Following is other info, let
>>me know if there is anything else that is needed. This 65 node cluster
>>of P4
>>computers has been running fine for 3 years. I had issues on the master node
>>and ended up doing a fresh install going to fedora4 from RH9 on it. Now I
>>have my problem. Thanks for the help. -Tom Combs
>>
>>*** Version & build:
>>torque-2.0.0p8
>>configure --prefix=/usr/local/torque --set-server-home=/opt/torque
>>--with-scp
>> --set-default-server=cmt --enable-docs
>>Fedora Core 4 on master, Redhad 9 on compute nodes
>>
>>
>
>This was built seperately on both OSes, right? You did an FC4 build and
>a RH9 build? I don't know if that is a factor, but it seems like the
>right thing to do.
>
>Please verify that /opt/torque/server_name contains "cmt" on the server
>and all nodes. Please verify the nodes don't have $clienthost,
>$pbsserverhost, $restricted, etc. in /opt/torque/mom_priv/config.
>
>Does /opt/torque/server_priv/torque.cfg have anything on the server
>host?
>
>
>
>
>>*** iptables and selinux off on both the master and mom node
>>*** hostbased auth working for users both ways between the master and
>>mom node.
>>
>>
>
>iptables is off. Ok. The problem is the port number.
>
>
>
>
>>*** momctl -d 3 from the node (node-2):
>>[root at node-2 sbin]# ./momctl -d 3
>>Host: node-2/node-2 Version: 2.0.0p8
>>Server[0]: cmt (connection is active)
>> Init Msgs Received: 1 hellos/1 cluster-addrs
>> Init Msgs Sent: 42 hellos
>> Last Msg From Server: 8 seconds (StatusJob)
>> Last Msg To Server: 40 seconds
>>PID: 14719
>>HomeDirectory: /opt/torque/mom_priv
>>MOM active: 59013 seconds
>>Server Update Interval: 45 seconds
>>LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
>>
>>
>
>Might want to bump the loglevel, you might see something more.
>
>
>
>
>>Communication Model: RPP
>>TCP Timeout: 20 seconds
>>NOTE: no prolog configured
>>Alarm Time: 0 of 10 seconds
>>Trusted Client List: 192.0.0.10,192.0.0.12,127.0.0.1
>>
>>
>
>Your client-list is still empty. Did 'pbsnodes -r node-2' or restarting
>pbs_server fix that?
>
>
>
>
>>**** netstat on master (abbreviated):
>>[root at cmt init.d]# netstat -ntlp
>>Active Internet connections (only servers)
>>Proto Recv-Q Send-Q Local Address Foreign
>>Address State PID/Program name
>>tcp 0 0 0.0.0.0:15001
>>0.0.0.0:* LISTEN 16205/pbs_server
>>tcp 0 0 146.201.237.110:15004
>>0.0.0.0:* LISTEN 16201/pbs_sched
>>
>>
>
>pbs_server is on the right port. Odd that MOMs are trying to use the
>wrong port.
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
Tom Combs E-mail: combs at magnet.fsu.edu
National High Magnetic Field Laboratory Phone: (850) 644-1657
1800 E. Paul Dirac Drive Tallahassee, FL 32310
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060403/f2f2e82a/attachment.html
More information about the torqueusers
mailing list