[torqueusers] Unable to copy output and error files to the submission dir (scp works fine)
Shantanu Gadgil
shantanugadgil at yahoo.com
Mon Apr 9 22:07:20 MDT 2012
Hi,
Lets assume the following ...
You are submitting from 'submit_node' while logged in as user 'sergio'
The job gets scheduled for 'client_node'. ( I cant make out the client node's hostname from the logs below)
The reason is that the 'root at client_node' (pbs_mom is running as root) is not able to scp the file to 'segio at submit_node'.
I would use the following steps to get over this:
Login into the 'client_node' as 'root'. (Repeat following steps for each client_node in the cluster)
Try to ssh into the 'sergio at sumbit_node' (these could be more than one if you have allowed many machine to be submit nodes)
Also, from root at client_node ssh into the 'submit_node' using the FQDN ... the FQDN is usually what the pbs_mom uses.
Password less ssh should work in both cases!!!
Regards,
Shantanu
--- On Mon, 4/9/12, Sergio Belkin <sebelk at gmail.com> wrote:
> From: Sergio Belkin <sebelk at gmail.com>
> Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine)
> To: torqueusers at supercluster.org
> Date: Monday, April 9, 2012, 3:50 PM
> Hi,
>
> I'm using torque-mom-3.0.3 on Fedora 16. I'm completely
> newbie about
> of torque and I'm testing a pbs_server on a virtual machine
> an a
> pbs_client on the host. pbs_mom complains as follows on node
> (client
> machine):
>
> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270'
> failed with
> status=1, giving up after 4 attempts
> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
> /var/lib/torque/spool/270.mpimaster.mycluster.OU to
> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
> /var/lib/torque/spool/270.mpimaster.mycluster.ER
> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270'
> failed with
> status=1, giving up after 4 attempts
> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
> /var/lib/torque/spool/270.mpimaster.mycluster.ER to
> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270
> pbs_mom: LOG_ERROR::req_cpyfile,
>
> Unable to copy file
> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
> *** error from copy
> Permission denied
> (publickey,gssapi-keyex,gssapi-with-mic,password).
> lost connection
> *** end error output
> Output retained on that host in:
> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU
>
> I've read documentation and google about this problem and it
> don't
> seem to be a problem of ssh/scp. So:
>
> *I've tested /usr/bin/scp -rpB somefile
> sergio at mpimaster.mycluster:/home/sergio and
> works with no problem
> *I've tested putting scp into crontab and works fine too
>
> Of course mpimaster.mycluster is in
> /home/sergio./known_hosts matches
> on mpinode02 (client machine with pbs mom running) with
> /etc/ssh/ssh_host_rsa_key.pub on mpimaster.mycluster ...
>
> (I use keychain on both cases)
>
> So, I don't know what I am doing wrong. Please could you
> help me to
> solve this problem?
>
> Thanks in advance!
> --
> --
> Sergio Belkin http://www.sergiobelkin.com
> Watch More TV http://sebelk.blogspot.com
> LPIC-2 Certified - http://www.lpi.org
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list