[torqueusers] Jobs not terminating
Tom Combs
combs at magnet.fsu.edu
Thu Mar 30 06:57:31 MST 2006
Garrick Staples wrote:
>On Wed, Mar 29, 2006 at 07:50:56PM +0300, Hristo Iliev alleged:
>
>
>>On Wed, 2006-03-29 at 11:05 -0500, Tom Combs wrote:
>>
>>
>>>Hi, I just upgraded to torque-2.0.0.p8 and now jobs do not terminate nor
>>>can they be qdel'd. In the mom_logs on the nodes, I have the following:
>>>
>>> pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a, port 15000
>>>
>>>I have hostbased authentication working for all users between the master
>>>node and
>>>compute nodes - in both directions but that doesn't appear to be the
>>>issue. Jobs go
>>>into execution and seem to run just fine, it's just the pbs job never
>>>terminates.
>>>
>>>Does anyone know what my problem could be?
>>>
>>>TIA, Tom Combs
>>>
>>>
>>>
>>Hi.
>>
>>Recently we experienced the same problem after moving to 2.0.0p8 and the
>>reason turned out to be poorly set up /etc/hosts file. On each node the
>>node's hostname first appeared on the line where localhost (127.0.0.1)
>>was. Strange enough but this setup worked quite well with Torque
>>1.2.0p6.
>>
>>
>
>Interesting problem. That would cause pbs_server to advertise itself as
>localhost.
>
>pbs_server tells pbs_mom "Hi, here's a job from localhost, let me know
>when it is done."
>pbs_mom dutifully runs the job, sending status updates to all of its
>configured servers.
>When the jobs exits, pbs_mom attempts to send the jobobit to localhost.
>
>
>
My hosts file looks to be correct so this is not the issue. Here is a
sample:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost
192.0.0.10 cmt node-0 # server_name is cmt - this comment is
not part of hosts file....
192.0.0.11 node-1
192.0.0.12 node-2
192.0.0.13 node-3
Here is a momctl from one of the nodes:
[root at node-55 sbin]# ./momctl -d 3
Host: node-55/node-55 Version: 2.0.0p8
Server[0]: cmt (connection is active)
WARNING: no hello/cluster-addrs messages received from server
Init Msgs Sent: 7053 hellos
Last Msg From Server: 69831 seconds (StatusJob)
Last Msg To Server: 7 seconds
PID: 4228
HomeDirectory: /opt/torque/mom_priv
MOM active: 70632 seconds
Server Update Interval: 45 seconds
LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
TCP Timeout: 20 seconds
NOTE: no prolog configured
Alarm Time: 0 of 10 seconds
Trusted Client List: 192.0.0.10,192.0.0.65,127.0.0.1
Job[57.cmt] State=EXITING
Assigned CPU Count: 1
diagnostics complete
Thanks for the help, you people are great. --Tom Combs
--
Tom Combs E-mail: combs at magnet.fsu.edu
National High Magnetic Field Laboratory Phone: (850) 644-1657
1800 E. Paul Dirac Drive Tallahassee, FL 32310
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060330/0213a176/attachment.html
More information about the torqueusers
mailing list