[torqueusers] Jobs not terminating
Tom Combs
combs at magnet.fsu.edu
Fri Mar 31 07:50:42 MST 2006
Garrick, Thanks for your reply. I'm going to start over with my current
status, thus no included responses....
Here's the problem: I can submit jobs, they run with all the user
specified tasks
being performed, but then hang at job termination. The out and in files
remain
in the spool directory and qstat shows the job as runnng. The owner of
the job
has no processes running on the mom node. The repeated error in the
mom_log is:
pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a,
port 15000,jobid 0.cmt errno 111
The are no unusual messages in the server_log. Following is other info, let
me know if there is anything else that is needed. This 65 node cluster
of P4
computers has been running fine for 3 years. I had issues on the master node
and ended up doing a fresh install going to fedora4 from RH9 on it. Now I
have my problem. Thanks for the help. -Tom Combs
*** Version & build:
torque-2.0.0p8
configure --prefix=/usr/local/torque --set-server-home=/opt/torque
--with-scp
--set-default-server=cmt --enable-docs
Fedora Core 4 on master, Redhad 9 on compute nodes
*** iptables and selinux off on both the master and mom node
*** hostbased auth working for users both ways between the master and
mom node.
*** momctl -d 3 from the node (node-2):
[root at node-2 sbin]# ./momctl -d 3
Host: node-2/node-2 Version: 2.0.0p8
Server[0]: cmt (connection is active)
Init Msgs Received: 1 hellos/1 cluster-addrs
Init Msgs Sent: 42 hellos
Last Msg From Server: 8 seconds (StatusJob)
Last Msg To Server: 40 seconds
PID: 14719
HomeDirectory: /opt/torque/mom_priv
MOM active: 59013 seconds
Server Update Interval: 45 seconds
LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
TCP Timeout: 20 seconds
NOTE: no prolog configured
Alarm Time: 0 of 10 seconds
Trusted Client List: 192.0.0.10,192.0.0.12,127.0.0.1
Job[0.cmt] State=EXITING
Assigned CPU Count: 1
diagnostics complete
**** momctl -h node-2 -d 3 from the master node (cmt)
[root at cmt init.d]# momctl -h node-2 -d 3
Host: node-2/node-2 Version: 2.0.0p8
Server[0]: cmt (connection is active)
Init Msgs Received: 1 hellos/1 cluster-addrs
Init Msgs Sent: 42 hellos
Last Msg From Server: 30 seconds (StatusJob)
Last Msg To Server: 10 seconds
PID: 14719
HomeDirectory: /opt/torque/mom_priv
MOM active: 59305 seconds
Server Update Interval: 45 seconds
LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
TCP Timeout: 20 seconds
NOTE: no prolog configured
Alarm Time: 0 of 10 seconds
Trusted Client List: 192.0.0.10,192.0.0.12,127.0.0.1
Job[0.cmt] State=EXITING
Assigned CPU Count: 1
diagnostics complete
**** netstat on master (abbreviated):
[root at cmt init.d]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign
Address State PID/Program name
tcp 0 0 0.0.0.0:15001
0.0.0.0:* LISTEN 16205/pbs_server
tcp 0 0 146.201.237.110:15004
0.0.0.0:* LISTEN 16201/pbs_sched
--
Tom Combs E-mail: combs at magnet.fsu.edu
National High Magnetic Field Laboratory Phone: (850) 644-1657
1800 E. Paul Dirac Drive Tallahassee, FL 32310
More information about the torqueusers
mailing list