[torqueusers] Jobs stay at status "E"

Clotho Tsang wytsang at clustertech.com
Mon Apr 15 21:00:38 MDT 2013

Another common reason is that, password-less ssh is set for the short
hostname, but not set for the full hostname. This depends on your DNS.

Example /var/log/message:

Feb 9 16:32:32 abc pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
/var/sp ool/torque/spool/21.example.com to
ada at myhost.example.com<liumh at rocks4SPAMNOT.lcg.ustc.edu.cn>:/home

Feb 9 16:32:36 abc pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
/va r/spool/torque/spool/21.example.com.ER
ada at myhost.example.com<liumh at rocks4SPAMNOT.lcg.ustc.edu.cn>:/home/ada/pbsjob.e21'
failed with status=1, giving up after 4 attempts

To trouble-shoot the problem, rerun the scp command (In the example above
ie. "/usr/bin/scp -rpB /va r/spool/torque/spool/21.example.com.ER
ada at myhost.example.com
<liumh at rocks4SPAMNOT.lcg.ustc.edu.cn>:/home/ada/pbsjob.e21")
to see what is the problem.

On 16 April 2013 10:48, Clotho Tsang <wytsang at clustertech.com> wrote:

> Sometimes I find that jobs stay at status "E".
> After some investigation, it is because the computation nodes
> unable to scp files back to the job submission node.
> One possible cause is that password-less ssh is not set.
> One can find the detail error message at /var/log/message
> of the computation node.
