[torqueusers] Jobs stay at status "E"

Burkhard Bunk bunk at physik.hu-berlin.de
Tue Apr 16 06:41:13 MDT 2013


Hi,

yet another reason for scp failures, for the records: sshd on the master 
node is unable to handle enough connections (if many jobs terminate almost 
simultaneously, e.g.).
You may have to increase MaxStartups (on the master, in 
/etc/ssh/sshd_config) and reload sshd. [Default is 10, I had to set it
to 200 on a cluster with 176 compute nodes.]

Regards,
Burkhard Bunk.
----------------------------------------------------------------------
  bunk at physik.hu-berlin.de      Physics Institute, Humboldt University
  fax:    ++49-30 2093 7628     Newtonstr. 15
  phone:  ++49-30 2093 7980     12489 Berlin, Germany
----------------------------------------------------------------------

On Tue, 16 Apr 2013, Clotho Tsang wrote:

> 
> Another common reason is that, password-less ssh is set for the short
> hostname, but not set for the full hostname. This depends on your DNS.
> 
> Example /var/log/message:
> 
> Feb 9 16:32:32 abc pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
> /var/sp ool/torque/spool/21.example.com to ada at myhost.example.com:/home
> /ada/pbsjob.o21
> 
> Feb 9 16:32:36 abc pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
> /va r/spool/torque/spool/21.example.com.ER
> ada at myhost.example.com:/home/ada/pbsjob.e21' failed with status=1, giving up
> after 4 attempts
> 
> 
> To trouble-shoot the problem, rerun the scp command (In the example above ie.
> "/usr/bin/scp -rpB /va r/spool/torque/spool/21.example.com.ER
> ada at myhost.example.com:/home/ada/pbsjob.e21") to see what is the problem.
> 
> 
> 
> On 16 April 2013 10:48, Clotho Tsang <wytsang at clustertech.com> wrote:
>       Sometimes I find that jobs stay at status "E".
> 
> After some investigation, it is because the computation nodes
> unable to scp files back to the job submission node.
> 
> One possible cause is that password-less ssh is not set.
> One can find the detail error message at /var/log/message
> of the computation node.
> 
> 
> 
>


More information about the torqueusers mailing list