[torqueusers] Unable to copy output and error files to the submission dir (scp works fine)

Shantanu Gadgil shantanugadgil at yahoo.com
Thu Apr 12 12:36:44 MDT 2012


Hi Sergio,

Please see my comments inline ... I have few queries and idea ... hopefully they'll help ... :)

--- On Thu, 4/12/12, Sergio Belkin <sebelk at gmail.com> wrote:

> From: Sergio Belkin <sebelk at gmail.com>
> Subject: Re: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine)
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Date: Thursday, April 12, 2012, 11:41 PM
> 2012/4/10 Shantanu Gadgil <shantanugadgil at yahoo.com>:
> > Hi,
> >
> > Lets assume the following ...
> > You are submitting from 'submit_node' while logged in
> as user 'sergio'
> > The job gets scheduled for 'client_node'. ( I cant make
> out the client node's hostname from the logs below)
> >
> > The reason is that the 'root at client_node' (pbs_mom is
> running as root) is not able to scp the file to
> 'segio at submit_node'.
> >
> > I would use the following steps to get over this:
> >
> > Login into the 'client_node' as 'root'. (Repeat
> following steps for each client_node in the cluster)
> > Try to ssh into the 'sergio at sumbit_node' (these could
> be more than one if you have allowed many machine to be
> submit nodes)
> >
> > Also, from root at client_node ssh into the 'submit_node'
> using the FQDN ... the FQDN is usually what the pbs_mom
> uses.
> >
> > Password less ssh should work in both cases!!!
> >
> > Regards,
> > Shantanu
> >
> >
> > --- On Mon, 4/9/12, Sergio Belkin <sebelk at gmail.com>
> wrote:
> >
> >> From: Sergio Belkin <sebelk at gmail.com>
> >> Subject: [torqueusers] Unable to copy output and
> error files to the submission dir (scp works fine)
> >> To: torqueusers at supercluster.org
> >> Date: Monday, April 9, 2012, 3:50 PM
> >> Hi,
> >>
> >> I'm using torque-mom-3.0.3 on Fedora 16. I'm
> completely
> >> newbie about
> >> of torque and I'm testing a pbs_server on a virtual
> machine
> >> an a
> >> pbs_client on the host. pbs_mom complains as
> follows on node
> >> (client
> >> machine):
> >>
> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp
> -rpB
> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270'
> >> failed with
> >> status=1, giving up after 4 attempts
> >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
> file
> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> to
> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp
> -rpB
> >> /var/lib/torque/spool/270.mpimaster.mycluster.ER
> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270'
> >> failed with
> >> status=1, giving up after 4 attempts
> >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
> file
> >> /var/lib/torque/spool/270.mpimaster.mycluster.ER
> to
> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270
> >> pbs_mom: LOG_ERROR::req_cpyfile,
> >>
> >> Unable to copy file
> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU
> >> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270
> >> *** error from copy
> >> Permission denied
> >> (publickey,gssapi-keyex,gssapi-with-mic,password).
> >> lost connection
> >> *** end error output
> >> Output retained on that host in:
> >>
> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU
> >>
> >> I've read documentation and google about this
> problem and it
> >> don't
> >> seem to be a problem of ssh/scp. So:
> >>
> >> *I've tested /usr/bin/scp -rpB somefile
> >> sergio at mpimaster.mycluster:/home/sergio   and
> >> works with no problem
> >> *I've tested putting scp into crontab and works
> fine too
> >>
> >> Of course mpimaster.mycluster is in
> >> /home/sergio./known_hosts matches
> >> on mpinode02 (client machine with pbs mom running)
> with
> >> /etc/ssh/ssh_host_rsa_key.pub on
> mpimaster.mycluster ...
> >>
> >> (I use keychain on both cases)
> >>
> >> So, I don't know what I am doing wrong. Please
> could you
> >> help me to
> >> solve this problem?
> >>
> >> Thanks in advance!
> >> --
> >> --
> 
> Thanks Shantanu for your answer.
> 
> It still failing:
> 
> mpinode02.mycluster is a computing node
> mpimaster.mycluster is the server
> sergio is the non-root user that submits jobs
> 
> I've tried:
> 
> Creating /root/.ssh/config
> 
> Host mpimaster.mycluster
>     User sergio
>     GSSAPIAuthentication no
>     IdentityFile ~sergio/.ssh/id_rsa
> 
> 
> And appending to /root/.bashrc the following:
> 
> /usr/bin/keychain --nogui ~sergio/.ssh/id_rsa
> source ~sergio/.keychain/sebelk.argentina-sh
> 
> So I login on mpinode02 as test user (test is a 
> non-root user), then
> I run "su root" and I could do ssh and  scp to
> mpimaster with no
> problem, but when I submit a job via torque, failing again
> as my first
> post.

I presume the user 'sergio' has a shared home directory on mpimaster and mpinode2 ?

Which root user? mpimaster or mpinode2 ?!?

I am confused a little here ... so I'll ask again ...
* you sshed from root at mpinode2 to sergio at mpimaster ?
OR
* you sshed from root at mpinode2 to root at mpimaster?

Did you try the fqdn while doing ssh from root at mpinode2?

Can you please post the actual commandlines that you used ?

Maybe pbs_mom doesn't source the startup script (.bashrc) and so never knows about the keychain ?!?

Is it possible to try the above without keychain?

i.e. append pubkey of root at mpinode2 to the file ~sergio/.ssh/authorized_keys (and root at mpimaster)

... and see if password less ssh works that way?

Also, on a side note ... do you think the pbs_mom's "usecp" directive could help you get around this rather than allow "ssh" at all ?!?
Ref: http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/a.cmomconfig.php

Regards,
Shantanu


More information about the torqueusers mailing list