[torqueusers] scp unreliability
Darren Platt
darren at 23andme.com
Thu Jun 5 14:01:13 MDT 2008
Just to elaborate on my earlier comments on the scp mechanism for file
transfer. Here's a simple test that breaks it on our (modestly small) test
cluster:
# Run a lot of small quickly exiting jobs on the cluster
echo "sleep 0.1; echo hello world" | qsub -t 1-100
Post completion:
ls -l STDIN.* | wc -l
118
Random stdout/stderr files are not returned. There doesn't seem to be a
pattern. Ranges go missing, presumably when there were just too many
results
coming back simultaneously. Here's the full set. For example STDOUT 21-36
are missed. The count varies run to run. The others all ended up in the
undelivered
directories across the nodes. Is there any way to get torque to retry
delivery a few times in the event of failure?
$ ls STD*
STDIN.e5510-1 STDIN.e5510-18 STDIN.e5510-39 STDIN.e5510-51
STDIN.e5510-72 STDIN.e5510-84 STDIN.e5510-93 STDIN.o5510-15
STDIN.o5510-39 STDIN.o5510-50 STDIN.o5510-72 STDIN.o5510-83
STDIN.e5510-10 STDIN.e5510-19 STDIN.e5510-4 STDIN.e5510-52
STDIN.e5510-73 STDIN.e5510-85 STDIN.e5510-94 STDIN.o5510-16
STDIN.o5510-4 STDIN.o5510-51 STDIN.o5510-73 STDIN.o5510-84
STDIN.e5510-100 STDIN.e5510-2 STDIN.e5510-40 STDIN.e5510-54
STDIN.e5510-74 STDIN.e5510-86 STDIN.e5510-97 STDIN.o5510-17
STDIN.o5510-40 STDIN.o5510-53 STDIN.o5510-74 STDIN.o5510-85
STDIN.e5510-11 STDIN.e5510-20 STDIN.e5510-41 STDIN.e5510-55
STDIN.e5510-75 STDIN.e5510-87 STDIN.o5510-1 STDIN.o5510-18
STDIN.o5510-41 STDIN.o5510-56 STDIN.o5510-75 STDIN.o5510-87
STDIN.e5510-12 STDIN.e5510-22 STDIN.e5510-42 STDIN.e5510-6
STDIN.e5510-76 STDIN.e5510-88 STDIN.o5510-10 STDIN.o5510-19
STDIN.o5510-42 STDIN.o5510-6 STDIN.o5510-76 STDIN.o5510-88
STDIN.e5510-13 STDIN.e5510-26 STDIN.e5510-43 STDIN.e5510-63
STDIN.e5510-79 STDIN.e5510-89 STDIN.o5510-100 STDIN.o5510-2
STDIN.o5510-44 STDIN.o5510-63 STDIN.o5510-79 STDIN.o5510-9
STDIN.e5510-14 STDIN.e5510-3 STDIN.e5510-45 STDIN.e5510-65
STDIN.e5510-8 STDIN.e5510-9 STDIN.o5510-11 STDIN.o5510-20
STDIN.o5510-45 STDIN.o5510-65 STDIN.o5510-8 STDIN.o5510-91
STDIN.e5510-15 STDIN.e5510-30 STDIN.e5510-49 STDIN.e5510-7
STDIN.e5510-80 STDIN.e5510-90 STDIN.o5510-12 STDIN.o5510-3
STDIN.o5510-46 STDIN.o5510-7 STDIN.o5510-80 STDIN.o5510-97
STDIN.e5510-16 STDIN.e5510-34 STDIN.e5510-5 STDIN.e5510-70
STDIN.e5510-81 STDIN.e5510-91 STDIN.o5510-13 STDIN.o5510-37
STDIN.o5510-49 STDIN.o5510-70 STDIN.o5510-81
STDIN.e5510-17 STDIN.e5510-37 STDIN.e5510-50 STDIN.e5510-71
STDIN.e5510-83 STDIN.e5510-92 STDIN.o5510-14 STDIN.o5510-38
STDIN.o5510-5 STDIN.o5510-71 STDIN.o5510-82
--
Darren Platt
Senior Director, Research
23andMe, inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080605/08c8d76c/attachment.html
More information about the torqueusers
mailing list