[torqueusers] Jobs not terminating

Tom Combs combs at magnet.fsu.edu
Thu Mar 30 06:57:31 MST 2006


Garrick Staples wrote:

>On Wed, Mar 29, 2006 at 07:50:56PM +0300, Hristo Iliev alleged:
>  
>
>>On Wed, 2006-03-29 at 11:05 -0500, Tom Combs wrote:
>>    
>>
>>>Hi,  I just upgraded to torque-2.0.0.p8 and now jobs do not terminate nor
>>>can they be qdel'd.  In the mom_logs on the nodes, I have the following:
>>>
>>> pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a, port 15000
>>>
>>>I have hostbased authentication working for all users between the master 
>>>node and
>>>compute nodes - in both directions but that doesn't appear to be the 
>>>issue. Jobs go
>>>into execution and seem to run just fine, it's just the pbs job never 
>>>terminates.
>>>
>>>Does anyone know what my problem could be?
>>>
>>>TIA,  Tom Combs
>>>
>>>      
>>>
>>Hi.
>>
>>Recently we experienced the same problem after moving to 2.0.0p8 and the
>>reason turned out to be poorly set up /etc/hosts file. On each node the
>>node's hostname first appeared on the line where localhost (127.0.0.1)
>>was. Strange enough but this setup worked quite well with Torque
>>1.2.0p6.
>>    
>>
>
>Interesting problem.  That would cause pbs_server to advertise itself as
>localhost.
>
>pbs_server tells pbs_mom "Hi, here's a job from localhost, let me know
>when it is done."
>pbs_mom dutifully runs the job, sending status updates to all of its
>configured servers.
>When the jobs exits, pbs_mom attempts to send the jobobit to localhost.
>
>  
>

  My hosts file looks to be correct so this is not the issue. Here is a 
sample:
# Do not remove the following line, or various programs
# that require network functionality will fail.

127.0.0.1       localhost
192.0.0.10      cmt node-0       # server_name is cmt - this comment is 
not part of hosts file....
192.0.0.11      node-1
192.0.0.12      node-2
192.0.0.13      node-3

Here is a momctl from one of the nodes:
[root at node-55 sbin]# ./momctl -d 3

Host: node-55/node-55   Version: 2.0.0p8
Server[0]: cmt (connection is active)
  WARNING:  no hello/cluster-addrs messages received from server
  Init Msgs Sent:         7053 hellos
  Last Msg From Server:   69831 seconds (StatusJob)
  Last Msg To Server:     7 seconds
PID:                    4228
HomeDirectory:          /opt/torque/mom_priv
MOM active:             70632 seconds
Server Update Interval: 45 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
TCP Timeout:            20 seconds
NOTE:  no prolog configured
Alarm Time:             0 of 10 seconds
Trusted Client List:    192.0.0.10,192.0.0.65,127.0.0.1
Job[57.cmt]  State=EXITING
Assigned CPU Count:     1

diagnostics complete


 Thanks for the help, you people are great.     --Tom Combs





-- 
Tom Combs                                  E-mail: combs at magnet.fsu.edu
National High Magnetic Field Laboratory    Phone: (850) 644-1657
1800 E. Paul Dirac Drive                   Tallahassee, FL 32310

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060330/0213a176/attachment.html


More information about the torqueusers mailing list