[torqueusers] trouble getting started -- jobs stuck in queue
Jeff Anderson-Lee
jonah at eecs.berkeley.edu
Tue Feb 9 12:45:33 MST 2010
Garrick Staples wrote:
> On Tue, Feb 09, 2010 at 10:24:12AM -0800, Jeff Anderson-Lee alleged:
>
>> I'm sure it's something simple. For instance, on which nodes am I
>> supposed to run pbs_mom and pbs_sched? The head node? The compute
>> nodes? Both??
>>
>
> The first question I was going to ask is if you started the scheduler.
>
>
> The headnode has pbs_server and a scheduler. Torque comes with a sample fifo
> scheduler called pbs_sched that you can run. Most sites eventually move to the
> Maui (free) or Moab (non-free) schedulers.
>
> The compute nodes run pbs_mom.
>
Thanks. It was not clear from the documentation and examples I'd found
so far what ran where.
It seems as if at least part of the problems I'm having has to do with
ssh permissions. Some time later after I started the scheduler I got an
e-mail with an error message:
> PBS Job Id: 0.FOO.berkeley.edu
> Job Name: STDIN
> Exec host: s105/0
> An error has occurred processing your job, see below.
> Post job file processing error; job 0.FOO.berkeley.edu on host s105/0
>
> Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.OU to
jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.o0
> *** error from copy
> Host key verification failed.
> lost connection
> *** end error output
> Output retained on that host in:
/var/spool/torque/undelivered/0.FOO.berkeley.edu.OU
>
> Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.ER to
jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.e0
> *** error from copy
> Host key verification failed.
> lost connection
> *** end error output
> Output retained on that host in:
/var/spool/torque/undelivered/0.FOO.berkeley.edu.ER
It seems I need to force the host-keys into the known_hosts file, but
which known_hosts file? The users? root? The queue manager?
More information about the torqueusers
mailing list