[torqueusers] scp unreliability

Darren Platt darren at 23andme.com
Fri Jun 6 09:54:33 MDT 2008


On Fri, Jun 6, 2008 at 4:37 AM, Chris Samuel <csamuel at vpac.org> wrote:

>
> ----- "Darren Platt" <darren at 23andme.com> wrote:
>
> > Just to elaborate on my earlier comments on the scp mechanism for file
> > transfer. Here's a simple test that breaks it on our (modestly small)
> > test cluster:
>
> Two questions:
>
> 1) Is this with 2.3 ?


yes , 2.3.0


>
> 2) Can you check your syslog and mom logs for things like:
>
> pbs_mom: No such file or directory (2) in open_std_file, cannot open/create
> stdout/stderr file '/usr/spool/PBS/spool/253428.tango-m.vpac.org.OU'


didn't find this one .  I did locate the notification of failure finally in
the syslogs (is there any way of putting these in the mom logs instead?)

Jun  5 11:52:31 cs0301 pbs_mom: req_cpyfile, Unable to copy file
/opt/torque-data/spool/5511-94.cs0300.corp.23andme.
com.OU to bio at cs0300.corp.23andme.com:/home/bio/STDIN.o5511-94
Jun  5 11:52:35 cs0301 pbs_mom: sys_copy, command '/usr/bin/scp -o
StrictHostKeyChecking=no /opt/torque-data/spool/5
511-92.cs0300.corp.23andme.com.ER
bio at cs0300.corp.23andme.com:/home/bio/STDIN.e5511-92'
failed with status=1, giving
 up after 4 attempts

Looks like it's just overwhelming scp's capacity

Darren



>
> as we're seeing this occasionally on some nodes, and some
> extra debugging I added implied that O_CREAT was disappearing.
>
> Have just recompiled my mom's with extra code to print out
> where that might happen to see if it's deliberately getting
> dropped or not, but it may take a little time to work out
> what's going on..
>
> cheers,
> Chris
> --
> Christopher Samuel - (03) 9925 4751 - Systems Manager
>  The Victorian Partnership for Advanced Computing
>  P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Darren Platt
Senior Director, Research
23andMe, inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080606/92c8348e/attachment.html


More information about the torqueusers mailing list