[torqueusers] trouble getting started -- jobs stuck in queue

Garrick Staples garrick at usc.edu
Tue Feb 9 13:46:31 MST 2010


On Tue, Feb 09, 2010 at 11:45:33AM -0800, Jeff Anderson-Lee alleged:
> Garrick Staples wrote:
> > On Tue, Feb 09, 2010 at 10:24:12AM -0800, Jeff Anderson-Lee alleged:
> >   
> >> I'm sure it's something simple.  For instance, on which nodes am I 
> >> supposed to run pbs_mom and pbs_sched?  The head node?  The compute 
> >> nodes?  Both??
> >>     
> >
> > The first question I was going to ask is if you started the scheduler.
> >
> >
> > The headnode has pbs_server and a scheduler. Torque comes with a sample fifo
> > scheduler called pbs_sched that you can run. Most sites eventually move to the
> > Maui (free) or Moab (non-free) schedulers.
> >
> > The compute nodes run pbs_mom.
> >   
> Thanks.  It was not clear from the documentation and examples I'd found 
> so far what ran where.

The first 2 paragraphs:
http://www.clusterresources.com/products/torque/docs/1.1installation.shtml

 
> It seems as if at least part of the problems I'm having has to do with 
> ssh permissions.  Some time later after I started the scheduler I got an 
> e-mail with an error message:
> 
>  > PBS Job Id: 0.FOO.berkeley.edu
>  > Job Name:   STDIN
>  > Exec host:  s105/0
>  > An error has occurred processing your job, see below.
>  > Post job file processing error; job 0.FOO.berkeley.edu on host s105/0
>  >
>  > Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.OU to 
> jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.o0
>  > *** error from copy
>  > Host key verification failed.
>  > lost connection
>  > *** end error output
>  > Output retained on that host in: 
> /var/spool/torque/undelivered/0.FOO.berkeley.edu.OU
>  >
>  > Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.ER to 
> jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.e0
>  > *** error from copy
>  > Host key verification failed.
>  > lost connection
>  > *** end error output
>  > Output retained on that host in: 
> /var/spool/torque/undelivered/0.FOO.berkeley.edu.ER
> 
> It seems I need to force the host-keys into the known_hosts file, but 
> which known_hosts file?  The users? root? The queue manager?

As you can see in the error message, pbs_mom was running scp for user jonah.
Root never needs to scp.

http://www.clusterresources.com/products/torque/docs/6.1scpsetup.shtml

Though, I use /etc/ssh/ssh_known_hosts so that users don't need to deal with
it.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100209/9db5d1fc/attachment-0001.bin 


More information about the torqueusers mailing list