[torqueusers] trouble getting started -- jobs stuck in queue

Jeff Anderson-Lee jonah at eecs.berkeley.edu
Tue Feb 9 12:45:33 MST 2010


Garrick Staples wrote:
> On Tue, Feb 09, 2010 at 10:24:12AM -0800, Jeff Anderson-Lee alleged:
>   
>> I'm sure it's something simple.  For instance, on which nodes am I 
>> supposed to run pbs_mom and pbs_sched?  The head node?  The compute 
>> nodes?  Both??
>>     
>
> The first question I was going to ask is if you started the scheduler.
>
>
> The headnode has pbs_server and a scheduler. Torque comes with a sample fifo
> scheduler called pbs_sched that you can run. Most sites eventually move to the
> Maui (free) or Moab (non-free) schedulers.
>
> The compute nodes run pbs_mom.
>   
Thanks.  It was not clear from the documentation and examples I'd found 
so far what ran where.

It seems as if at least part of the problems I'm having has to do with 
ssh permissions.  Some time later after I started the scheduler I got an 
e-mail with an error message:

 > PBS Job Id: 0.FOO.berkeley.edu
 > Job Name:   STDIN
 > Exec host:  s105/0
 > An error has occurred processing your job, see below.
 > Post job file processing error; job 0.FOO.berkeley.edu on host s105/0
 >
 > Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.OU to 
jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.o0
 > *** error from copy
 > Host key verification failed.
 > lost connection
 > *** end error output
 > Output retained on that host in: 
/var/spool/torque/undelivered/0.FOO.berkeley.edu.OU
 >
 > Unable to copy file /var/spool/torque/spool/0.FOO.berkeley.edu.ER to 
jonah at FOO.berkeley.edu:/home/cs/jonah/STDIN.e0
 > *** error from copy
 > Host key verification failed.
 > lost connection
 > *** end error output
 > Output retained on that host in: 
/var/spool/torque/undelivered/0.FOO.berkeley.edu.ER

It seems I need to force the host-keys into the known_hosts file, but 
which known_hosts file?  The users? root? The queue manager?




More information about the torqueusers mailing list