[torqueusers] Unable to copy output and error files to the submission dir (scp works fine)

Sergio Belkin sebelk at gmail.com
Thu Apr 12 12:11:34 MDT 2012


2012/4/10 Shantanu Gadgil <shantanugadgil at yahoo.com>:
> Hi,
>
> Lets assume the following ...
> You are submitting from 'submit_node' while logged in as user 'sergio'
> The job gets scheduled for 'client_node'. ( I cant make out the client node's hostname from the logs below)
>
> The reason is that the 'root at client_node' (pbs_mom is running as root) is not able to scp the file to 'segio at submit_node'.
>
> I would use the following steps to get over this:
>
> Login into the 'client_node' as 'root'. (Repeat following steps for each client_node in the cluster)
> Try to ssh into the 'sergio at sumbit_node' (these could be more than one if you have allowed many machine to be submit nodes)
>
> Also, from root at client_node ssh into the 'submit_node' using the FQDN ... the FQDN is usually what the pbs_mom uses.
>
> Password less ssh should work in both cases!!!
>
> Regards,
> Shantanu
>
>
> --- On Mon, 4/9/12, Sergio Belkin <sebelk at gmail.com> wrote:
>
>> From: Sergio Belkin <sebelk at gmail.com>
>> Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine)
>> To: torqueusers at supercluster.org
>> Date: Monday, April 9, 2012, 3:50 PM
>> Hi,
>>
>> I'm using torque-mom-3.0.3 on Fedora 16. I'm completely
>> newbie about
>> of torque and I'm testing a pbs_server on a virtual machine
>> an a
>> pbs_client on the host. pbs_mom complains as follows on node
>> (client
>> machine):
>>
>> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
>> /var/lib/torque/spool/270.mpimaster.mycluster.OU
>> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270'
>> failed with
>> status=1, giving up after 4 attempts
>> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
>> /var/lib/torque/spool/270.mpimaster.mycluster.OU to
>> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
>> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB
>> /var/lib/torque/spool/270.mpimaster.mycluster.ER
>> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270'
>> failed with
>> status=1, giving up after 4 attempts
>> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file
>> /var/lib/torque/spool/270.mpimaster.mycluster.ER to
>> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270
>> pbs_mom: LOG_ERROR::req_cpyfile,
>>
>> Unable to copy file
>> /var/lib/torque/spool/270.mpimaster.mycluster.OU
>> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
>> *** error from copy
>> Permission denied
>> (publickey,gssapi-keyex,gssapi-with-mic,password).
>> lost connection
>> *** end error output
>> Output retained on that host in:
>> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU
>>
>> I've read documentation and google about this problem and it
>> don't
>> seem to be a problem of ssh/scp. So:
>>
>> *I've tested /usr/bin/scp -rpB somefile
>> sergio at mpimaster.mycluster:/home/sergio   and
>> works with no problem
>> *I've tested putting scp into crontab and works fine too
>>
>> Of course mpimaster.mycluster is in
>> /home/sergio./known_hosts matches
>> on mpinode02 (client machine with pbs mom running) with
>> /etc/ssh/ssh_host_rsa_key.pub on mpimaster.mycluster ...
>>
>> (I use keychain on both cases)
>>
>> So, I don't know what I am doing wrong. Please could you
>> help me to
>> solve this problem?
>>
>> Thanks in advance!
>> --
>> --

Thanks Shantanu for your answer.

It still failing:

mpinode02.mycluster is a computing node
mpimaster.mycluster is the server
sergio is the non-root user that submits jobs

I've tried:

Creating /root/.ssh/config

Host mpimaster.mycluster
    User sergio
    GSSAPIAuthentication no
    IdentityFile ~sergio/.ssh/id_rsa


And appending to /root/.bashrc the following:

/usr/bin/keychain --nogui ~sergio/.ssh/id_rsa
source ~sergio/.keychain/sebelk.argentina-sh

So I login on mpinode02 as test user (test is a  non-root user), then
I run "su root" and I could do ssh and  scp to mpimaster with no
problem, but when I submit a job via torque, failing again as my first
post.

I don't know what I doing wrong :(

Please could you help me?

Thanks in advance!
-- 
--
Sergio Belkin  http://www.sergiobelkin.com
Watch More TV http://sebelk.blogspot.com
LPIC-2 Certified - http://www.lpi.org


More information about the torqueusers mailing list