[torqueusers] Jobs not terminating

Garrick Staples garrick at usc.edu
Wed Mar 29 17:52:35 MST 2006


On Wed, Mar 29, 2006 at 07:50:56PM +0300, Hristo Iliev alleged:
> On Wed, 2006-03-29 at 11:05 -0500, Tom Combs wrote:
> > Hi,  I just upgraded to torque-2.0.0.p8 and now jobs do not terminate nor
> > can they be qdel'd.  In the mom_logs on the nodes, I have the following:
> > 
> >  pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a, port 15000
> > 
> > I have hostbased authentication working for all users between the master 
> > node and
> > compute nodes - in both directions but that doesn't appear to be the 
> > issue. Jobs go
> > into execution and seem to run just fine, it's just the pbs job never 
> > terminates.
> > 
> > Does anyone know what my problem could be?
> > 
> > TIA,  Tom Combs
> > 
> 
> Hi.
> 
> Recently we experienced the same problem after moving to 2.0.0p8 and the
> reason turned out to be poorly set up /etc/hosts file. On each node the
> node's hostname first appeared on the line where localhost (127.0.0.1)
> was. Strange enough but this setup worked quite well with Torque
> 1.2.0p6.

Interesting problem.  That would cause pbs_server to advertise itself as
localhost.

pbs_server tells pbs_mom "Hi, here's a job from localhost, let me know
when it is done."
pbs_mom dutifully runs the job, sending status updates to all of its
configured servers.
When the jobs exits, pbs_mom attempts to send the jobobit to localhost.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060329/60a557a8/attachment.bin


More information about the torqueusers mailing list