[torqueusers] scp unreliability
darren at 23andme.com
Fri Jun 6 09:54:33 MDT 2008
On Fri, Jun 6, 2008 at 4:37 AM, Chris Samuel <csamuel at vpac.org> wrote:
> ----- "Darren Platt" <darren at 23andme.com> wrote:
> > Just to elaborate on my earlier comments on the scp mechanism for file
> > transfer. Here's a simple test that breaks it on our (modestly small)
> > test cluster:
> Two questions:
> 1) Is this with 2.3 ?
yes , 2.3.0
> 2) Can you check your syslog and mom logs for things like:
> pbs_mom: No such file or directory (2) in open_std_file, cannot open/create
> stdout/stderr file '/usr/spool/PBS/spool/253428.tango-m.vpac.org.OU'
didn't find this one . I did locate the notification of failure finally in
the syslogs (is there any way of putting these in the mom logs instead?)
Jun 5 11:52:31 cs0301 pbs_mom: req_cpyfile, Unable to copy file
com.OU to bio at cs0300.corp.23andme.com:/home/bio/STDIN.o5511-94
Jun 5 11:52:35 cs0301 pbs_mom: sys_copy, command '/usr/bin/scp -o
bio at cs0300.corp.23andme.com:/home/bio/STDIN.e5511-92'
failed with status=1, giving
up after 4 attempts
Looks like it's just overwhelming scp's capacity
> as we're seeing this occasionally on some nodes, and some
> extra debugging I added implied that O_CREAT was disappearing.
> Have just recompiled my mom's with extra code to print out
> where that might happen to see if it's deliberately getting
> dropped or not, but it may take a little time to work out
> what's going on..
> Christopher Samuel - (03) 9925 4751 - Systems Manager
> The Victorian Partnership for Advanced Computing
> P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> torqueusers mailing list
> torqueusers at supercluster.org
Senior Director, Research
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers