[torqueusers] Unable to copy output and error files to the submission dir (scp works fine)

Shantanu Gadgil shantanugadgil at yahoo.com
Mon Apr 9 22:07:20 MDT 2012


Hi,

Lets assume the following ...
You are submitting from 'submit_node' while logged in as user 'sergio'
The job gets scheduled for 'client_node'. ( I cant make out the client node's hostname from the logs below)

The reason is that the 'root at client_node' (pbs_mom is running as root) is not able to scp the file to 'segio at submit_node'.

I would use the following steps to get over this:

Login into the 'client_node' as 'root'. (Repeat following steps for each client_node in the cluster)
Try to ssh into the 'sergio at sumbit_node' (these could be more than one if you have allowed many machine to be submit nodes)

Also, from root at client_node ssh into the 'submit_node' using the FQDN ... the FQDN is usually what the pbs_mom uses.

Password less ssh should work in both cases!!!

Regards,
Shantanu


--- On Mon, 4/9/12, Sergio Belkin <sebelk at gmail.com> wrote:

> From: Sergio Belkin <sebelk at gmail.com>
> Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine)
> To: torqueusers at supercluster.org
> Date: Monday, April 9, 2012, 3:50 PM
> Hi,
> 
> I'm using torque-mom-3.0.3 on Fedora 16. I'm completely
> newbie about
> of torque and I'm testing a pbs_server on a virtual machine
> an a
> pbs_client on the host. pbs_mom complains as follows on node
> (client
> machine):
> 
> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270'
> failed with
> status=1, giving up after 4 attempts
> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
> /var/lib/torque/spool/270.mpimaster.mycluster.OU to
> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
> /var/lib/torque/spool/270.mpimaster.mycluster.ER
> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270'
> failed with
> status=1, giving up after 4 attempts
> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
> /var/lib/torque/spool/270.mpimaster.mycluster.ER to
> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270
> pbs_mom: LOG_ERROR::req_cpyfile,
> 
> Unable to copy file
> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
> *** error from copy
> Permission denied
> (publickey,gssapi-keyex,gssapi-with-mic,password).
> lost connection
> *** end error output
> Output retained on that host in:
> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU
> 
> I've read documentation and google about this problem and it
> don't
> seem to be a problem of ssh/scp. So:
> 
> *I've tested /usr/bin/scp -rpB somefile
> sergio at mpimaster.mycluster:/home/sergio   and
> works with no problem
> *I've tested putting scp into crontab and works fine too
> 
> Of course mpimaster.mycluster is in
> /home/sergio./known_hosts matches
> on mpinode02 (client machine with pbs mom running) with
> /etc/ssh/ssh_host_rsa_key.pub on mpimaster.mycluster ...
> 
> (I use keychain on both cases)
> 
> So, I don't know what I am doing wrong. Please could you
> help me to
> solve this problem?
> 
> Thanks in advance!
> -- 
> --
> Sergio Belkin  http://www.sergiobelkin.com
> Watch More TV http://sebelk.blogspot.com
> LPIC-2 Certified - http://www.lpi.org
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list