[torqueusers] Jobs not terminating
Garrick Staples
garrick at usc.edu
Wed Mar 29 17:52:35 MST 2006
On Wed, Mar 29, 2006 at 07:50:56PM +0300, Hristo Iliev alleged:
> On Wed, 2006-03-29 at 11:05 -0500, Tom Combs wrote:
> > Hi, I just upgraded to torque-2.0.0.p8 and now jobs do not terminate nor
> > can they be qdel'd. In the mom_logs on the nodes, I have the following:
> >
> > pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a, port 15000
> >
> > I have hostbased authentication working for all users between the master
> > node and
> > compute nodes - in both directions but that doesn't appear to be the
> > issue. Jobs go
> > into execution and seem to run just fine, it's just the pbs job never
> > terminates.
> >
> > Does anyone know what my problem could be?
> >
> > TIA, Tom Combs
> >
>
> Hi.
>
> Recently we experienced the same problem after moving to 2.0.0p8 and the
> reason turned out to be poorly set up /etc/hosts file. On each node the
> node's hostname first appeared on the line where localhost (127.0.0.1)
> was. Strange enough but this setup worked quite well with Torque
> 1.2.0p6.
Interesting problem. That would cause pbs_server to advertise itself as
localhost.
pbs_server tells pbs_mom "Hi, here's a job from localhost, let me know
when it is done."
pbs_mom dutifully runs the job, sending status updates to all of its
configured servers.
When the jobs exits, pbs_mom attempts to send the jobobit to localhost.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060329/60a557a8/attachment.bin
More information about the torqueusers
mailing list