[torqueusers] trouble getting started -- jobs stuck in queue
Garrick Staples
garrick at usc.edu
Tue Feb 9 13:46:31 MST 2010
On Tue, Feb 09, 2010 at 11:45:33AM -0800, Jeff Anderson-Lee alleged:
> Garrick Staples wrote:
> > On Tue, Feb 09, 2010 at 10:24:12AM -0800, Jeff Anderson-Lee alleged:
> >
> >> I'm sure it's something simple. For instance, on which nodes am I
> >> supposed to run pbs_mom and pbs_sched? The head node? The compute
> >> nodes? Both??
> >>
> >
> > The first question I was going to ask is if you started the scheduler.
> >
> >
> > The headnode has pbs_server and a scheduler. Torque comes with a sample fifo
> > scheduler called pbs_sched that you can run. Most sites eventually move to the
> > Maui (free) or Moab (non-free) schedulers.
> >
> > The compute nodes run pbs_mom.
> >
> Thanks. It was not clear from the documentation and examples I'd found
> so far what ran where.
The first 2 paragraphs:
http://www.clusterresources.com/products/torque/docs/1.1installation.shtml
> It seems as if at least part of the problems I'm having has to do with
> ssh permissions. Some time later after I started the scheduler I got an
> e-mail with an error message:
>
> > PBS Job Id: 0.FOO.berkeley.edu
> > Job Name: STDIN
> > Exec host: s105/0
> > An error has occurred processing your job, see below.
> > Post job file processing error; job 0.FOO.berkeley.edu on host s105/0
> >
> > Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.OU to
> jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.o0
> > *** error from copy
> > Host key verification failed.
> > lost connection
> > *** end error output
> > Output retained on that host in:
> /var/spool/torque/undelivered/0.FOO.berkeley.edu.OU
> >
> > Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.ER to
> jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.e0
> > *** error from copy
> > Host key verification failed.
> > lost connection
> > *** end error output
> > Output retained on that host in:
> /var/spool/torque/undelivered/0.FOO.berkeley.edu.ER
>
> It seems I need to force the host-keys into the known_hosts file, but
> which known_hosts file? The users? root? The queue manager?
As you can see in the error message, pbs_mom was running scp for user jonah.
Root never needs to scp.
http://www.clusterresources.com/products/torque/docs/6.1scpsetup.shtml
Though, I use /etc/ssh/ssh_known_hosts so that users don't need to deal with
it.
--
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100209/9db5d1fc/attachment-0001.bin
More information about the torqueusers
mailing list