[torqueusers] Jobs not terminating

Tom Combs combs at magnet.fsu.edu
Fri Mar 31 07:50:42 MST 2006


Garrick, Thanks for your reply. I'm going to start over with my current
status, thus no included responses....

Here's the problem:  I can submit jobs, they run with all the user 
specified tasks
being performed, but then hang at job termination.  The out and in files 
remain
in the spool directory and qstat shows the job as runnng. The owner of 
the job
has no processes running on the mom node. The repeated error in the 
mom_log is:

 pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a,
   port 15000,jobid 0.cmt errno 111

The are no unusual messages in the server_log. Following is other info, let
me know if there is anything else that is needed. This 65 node cluster 
of P4
computers has been running fine for 3 years. I had issues on the master node
and ended up doing a fresh install going to fedora4 from RH9 on it. Now I
have my problem.  Thanks for the help.   -Tom Combs

*** Version & build:
 torque-2.0.0p8
 configure --prefix=/usr/local/torque --set-server-home=/opt/torque 
--with-scp
    --set-default-server=cmt --enable-docs
 Fedora Core 4 on master, Redhad 9 on compute nodes

*** iptables and selinux off on both the master and mom node
*** hostbased auth working for users both ways between the master and 
mom node.

*** momctl -d 3    from the node (node-2):
[root at node-2 sbin]# ./momctl -d 3
Host: node-2/node-2   Version: 2.0.0p8
Server[0]: cmt (connection is active)
  Init Msgs Received:     1 hellos/1 cluster-addrs
  Init Msgs Sent:         42 hellos
  Last Msg From Server:   8 seconds (StatusJob)
  Last Msg To Server:     40 seconds
PID:                    14719
HomeDirectory:          /opt/torque/mom_priv
MOM active:             59013 seconds
Server Update Interval: 45 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
TCP Timeout:            20 seconds
NOTE:  no prolog configured
Alarm Time:             0 of 10 seconds
Trusted Client List:    192.0.0.10,192.0.0.12,127.0.0.1
Job[0.cmt]  State=EXITING
Assigned CPU Count:     1
diagnostics complete


**** momctl -h node-2 -d 3  from the master node (cmt)
[root at cmt init.d]# momctl -h node-2 -d 3
Host: node-2/node-2   Version: 2.0.0p8
Server[0]: cmt (connection is active)
  Init Msgs Received:     1 hellos/1 cluster-addrs
  Init Msgs Sent:         42 hellos
  Last Msg From Server:   30 seconds (StatusJob)
  Last Msg To Server:     10 seconds
PID:                    14719
HomeDirectory:          /opt/torque/mom_priv
MOM active:             59305 seconds
Server Update Interval: 45 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
TCP Timeout:            20 seconds
NOTE:  no prolog configured
Alarm Time:             0 of 10 seconds
Trusted Client List:    192.0.0.10,192.0.0.12,127.0.0.1
Job[0.cmt]  State=EXITING
Assigned CPU Count:     1
diagnostics complete

****  netstat on master (abbreviated):
[root at cmt init.d]# netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign 
Address             State       PID/Program name
tcp        0      0 0.0.0.0:15001               
0.0.0.0:*                   LISTEN      16205/pbs_server
tcp        0      0 146.201.237.110:15004       
0.0.0.0:*                   LISTEN      16201/pbs_sched


-- 
Tom Combs                                  E-mail: combs at magnet.fsu.edu
National High Magnetic Field Laboratory    Phone: (850) 644-1657
1800 E. Paul Dirac Drive                   Tallahassee, FL 32310



More information about the torqueusers mailing list