[torqueusers] Unable to copy output and error files to the submission dir (scp works fine)

Sergio Belkin sebelk at gmail.com
Mon Apr 16 04:25:53 MDT 2012


2012/4/12 Shantanu Gadgil <shantanugadgil at yahoo.com>:
> Hi Sergio,
>
> Please see my comments inline ... I have few queries and idea ... hopefully they'll help ... :)
>
> --- On Thu, 4/12/12, Sergio Belkin <sebelk at gmail.com> wrote:
>
>> From: Sergio Belkin <sebelk at gmail.com>
>> Subject: Re: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine)
>> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
>> Date: Thursday, April 12, 2012, 11:41 PM
>> 2012/4/10 Shantanu Gadgil <shantanugadgil at yahoo.com>:
>> > Hi,
>> >
>> > Lets assume the following ...
>> > You are submitting from 'submit_node' while logged in
>> as user 'sergio'
>> > The job gets scheduled for 'client_node'. ( I cant make
>> out the client node's hostname from the logs below)
>> >
>> > The reason is that the 'root at client_node' (pbs_mom is
>> running as root) is not able to scp the file to
>> 'segio at submit_node'.
>> >
>> > I would use the following steps to get over this:
>> >
>> > Login into the 'client_node' as 'root'. (Repeat
>> following steps for each client_node in the cluster)
>> > Try to ssh into the 'sergio at sumbit_node' (these could
>> be more than one if you have allowed many machine to be
>> submit nodes)
>> >
>> > Also, from root at client_node ssh into the 'submit_node'
>> using the FQDN ... the FQDN is usually what the pbs_mom
>> uses.
>> >
>> > Password less ssh should work in both cases!!!
>> >
>> > Regards,
>> > Shantanu
>> >
>> >
>> > --- On Mon, 4/9/12, Sergio Belkin <sebelk at gmail.com>
>> wrote:
>> >
>> >> From: Sergio Belkin <sebelk at gmail.com>
>> >> Subject: [torqueusers] Unable to copy output and
>> error files to the submission dir (scp works fine)
>> >> To: torqueusers at supercluster.org
>> >> Date: Monday, April 9, 2012, 3:50 PM
>> >> Hi,
>> >>
>> >> I'm using torque-mom-3.0.3 on Fedora 16. I'm
>> completely
>> >> newbie about
>> >> of torque and I'm testing a pbs_server on a virtual
>> machine
>> >> an a
>> >> pbs_client on the host. pbs_mom complains as
>> follows on node
>> >> (client
>> >> machine):
>> >>
>> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp
>> -rpB
>> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU
>> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270'
>> >> failed with
>> >> status=1, giving up after 4 attempts
>> >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>> file
>> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU
>> to
>> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
>> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp
>> -rpB
>> >> /var/lib/torque/spool/270.mpimaster.mycluster.ER
>> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270'
>> >> failed with
>> >> status=1, giving up after 4 attempts
>> >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>> file
>> >> /var/lib/torque/spool/270.mpimaster.mycluster.ER
>> to
>> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270
>> >> pbs_mom: LOG_ERROR::req_cpyfile,
>> >>
>> >> Unable to copy file
>> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU
>> >> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
>> >> *** error from copy
>> >> Permission denied
>> >> (publickey,gssapi-keyex,gssapi-with-mic,password).
>> >> lost connection
>> >> *** end error output
>> >> Output retained on that host in:
>> >>
>> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU
>> >>
>> >> I've read documentation and google about this
>> problem and it
>> >> don't
>> >> seem to be a problem of ssh/scp. So:
>> >>
>> >> *I've tested /usr/bin/scp -rpB somefile
>> >> sergio at mpimaster.mycluster:/home/sergio   and
>> >> works with no problem
>> >> *I've tested putting scp into crontab and works
>> fine too
>> >>
>> >> Of course mpimaster.mycluster is in
>> >> /home/sergio./known_hosts matches
>> >> on mpinode02 (client machine with pbs mom running)
>> with
>> >> /etc/ssh/ssh_host_rsa_key.pub on
>> mpimaster.mycluster ...
>> >>
>> >> (I use keychain on both cases)
>> >>
>> >> So, I don't know what I am doing wrong. Please
>> could you
>> >> help me to
>> >> solve this problem?
>> >>
>> >> Thanks in advance!
>> >> --
>> >> --
>>
>> Thanks Shantanu for your answer.
>>
>> It still failing:
>>
>> mpinode02.mycluster is a computing node
>> mpimaster.mycluster is the server
>> sergio is the non-root user that submits jobs
>>
>> I've tried:
>>
>> Creating /root/.ssh/config
>>
>> Host mpimaster.mycluster
>>     User sergio
>>     GSSAPIAuthentication no
>>     IdentityFile ~sergio/.ssh/id_rsa
>>
>>
>> And appending to /root/.bashrc the following:
>>
>> /usr/bin/keychain --nogui ~sergio/.ssh/id_rsa
>> source ~sergio/.keychain/sebelk.argentina-sh
>>
>> So I login on mpinode02 as test user (test is a
>> non-root user), then
>> I run "su root" and I could do ssh and  scp to
>> mpimaster with no
>> problem, but when I submit a job via torque, failing again
>> as my first
>> post.
>
> I presume the user 'sergio' has a shared home directory on mpimaster and mpinode2 ?

No he hasn't. Sorry for the question if it is stupid: Is that an error?

>
> Which root user? mpimaster or mpinode2 ?!?

mpinode02 (client) root user

>
> I am confused a little here ... so I'll ask again ...
> * you sshed from root at mpinode2 to sergio at mpimaster ?
> OR
> * you sshed from root at mpinode2 to root at mpimaster?

I've sshed  from root at mpinode2 to sergio at mpimaster

>
> Did you try the fqdn while doing ssh from root at mpinode2?

Yes I did

>
> Can you please post the actual commandlines that you used ?

Yes I can, remember that test is a user of mpinode02. I've performed
such a jump of users to demonstrate that neither root nor sergio
launch a login shell


[sergio at sebelk ~]$ su - test
Contraseña:
[test at sebelk ~]$ su
Contraseña:

KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/
Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL

 * Found existing ssh-agent (2534)
 * Found existing gpg-agent (2560)
 * Adding 1 ssh key(s)...
Enter passphrase for /root/.ssh/id_rsa:
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)

[root at sebelk test]# ssh sergio at mpimaster.mycluster
Last login: Mon Apr 16 06:49:39 2012 from mpinode02.mycluster

KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/
Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL

 * Found existing ssh-agent (1149)
 * Found existing gpg-agent (1175)
 * Known ssh key: /home/sergio/.ssh/id_rsa

[sergio at mpimaster ~]$


As you can see, never the user it is prompted any passphrase or
password either. Note that root does not launch a session a login
shell and even so he can ssh as sergio to mpimaster.mycluster

That's because I've appended to /root/.bashrc at mpinode02:


## START-Keychain ###
# Let  re-use ssh-agent and/or gpg-agent between logins
#/usr/bin/keychain --nogui ~sergio/.ssh/id_rsa
/usr/bin/keychain --nogui /root/.ssh/id_rsa
#source ~sergio/.keychain/sebelk.argentina-sh
source /root/.keychain/sebelk.argentina-sh
## End-Keychain ###



>
> Maybe pbs_mom doesn't source the startup script (.bashrc) and so never knows about the keychain ?!?

Sure, I think that

>
> Is it possible to try the above without keychain?

keychain allow you passwordless login

>
> i.e. append pubkey of root at mpinode2 to the file ~sergio/.ssh/authorized_keys (and root at mpimaster)

pubkey of  of root at mpinode2 is already at
~sergio/.ssh/authorized_keys and root at mpimaster

>
> ... and see if password less ssh works that way?

No, in this case it asks for passphrase

>
> Also, on a side note ... do you think the pbs_mom's "usecp" directive could help you get around this rather than allow "ssh" at all ?!?
> Ref: http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/a.cmomconfig.php

I've tried to not use NFS and NIS they are ancient and insecure
services (perhaps I'm wrong), but well, maybe is a better option

What do you think?

>
> Regards,
> Shantanu
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



-- 
--
Sergio Belkin  http://www.sergiobelkin.com
Watch More TV http://sebelk.blogspot.com
LPIC-2 Certified - http://www.lpi.org


More information about the torqueusers mailing list