[torqueusers] scp unreliability

Darren Platt darren at 23andme.com
Thu Jun 5 14:01:13 MDT 2008

Just to elaborate on my earlier comments on the scp mechanism for file
transfer.  Here's a simple test that breaks it on our (modestly small) test

# Run a lot of small quickly exiting jobs on the cluster
echo  "sleep 0.1; echo hello world"  | qsub -t 1-100

Post completion:

ls -l STDIN.* | wc -l

Random stdout/stderr files are not returned.  There doesn't seem to be a
pattern.  Ranges go missing, presumably when there were just too many
coming back simultaneously.  Here's the full set.  For example STDOUT 21-36
are missed.  The count varies run to run.  The others all ended up in the
directories across the nodes.  Is there any way to get torque to retry
delivery a few times in the event of failure?

$ ls STD*
STDIN.e5510-1    STDIN.e5510-18  STDIN.e5510-39  STDIN.e5510-51
STDIN.e5510-72  STDIN.e5510-84  STDIN.e5510-93   STDIN.o5510-15
STDIN.o5510-39  STDIN.o5510-50  STDIN.o5510-72  STDIN.o5510-83
STDIN.e5510-10   STDIN.e5510-19  STDIN.e5510-4   STDIN.e5510-52
STDIN.e5510-73  STDIN.e5510-85  STDIN.e5510-94   STDIN.o5510-16
STDIN.o5510-4   STDIN.o5510-51  STDIN.o5510-73  STDIN.o5510-84
STDIN.e5510-100  STDIN.e5510-2   STDIN.e5510-40  STDIN.e5510-54
STDIN.e5510-74  STDIN.e5510-86  STDIN.e5510-97   STDIN.o5510-17
STDIN.o5510-40  STDIN.o5510-53  STDIN.o5510-74  STDIN.o5510-85
STDIN.e5510-11   STDIN.e5510-20  STDIN.e5510-41  STDIN.e5510-55
STDIN.e5510-75  STDIN.e5510-87  STDIN.o5510-1    STDIN.o5510-18
STDIN.o5510-41  STDIN.o5510-56  STDIN.o5510-75  STDIN.o5510-87
STDIN.e5510-12   STDIN.e5510-22  STDIN.e5510-42  STDIN.e5510-6
STDIN.e5510-76  STDIN.e5510-88  STDIN.o5510-10   STDIN.o5510-19
STDIN.o5510-42  STDIN.o5510-6   STDIN.o5510-76  STDIN.o5510-88
STDIN.e5510-13   STDIN.e5510-26  STDIN.e5510-43  STDIN.e5510-63
STDIN.e5510-79  STDIN.e5510-89  STDIN.o5510-100  STDIN.o5510-2
STDIN.o5510-44  STDIN.o5510-63  STDIN.o5510-79  STDIN.o5510-9
STDIN.e5510-14   STDIN.e5510-3   STDIN.e5510-45  STDIN.e5510-65
STDIN.e5510-8   STDIN.e5510-9   STDIN.o5510-11   STDIN.o5510-20
STDIN.o5510-45  STDIN.o5510-65  STDIN.o5510-8   STDIN.o5510-91
STDIN.e5510-15   STDIN.e5510-30  STDIN.e5510-49  STDIN.e5510-7
STDIN.e5510-80  STDIN.e5510-90  STDIN.o5510-12   STDIN.o5510-3
STDIN.o5510-46  STDIN.o5510-7   STDIN.o5510-80  STDIN.o5510-97
STDIN.e5510-16   STDIN.e5510-34  STDIN.e5510-5   STDIN.e5510-70
STDIN.e5510-81  STDIN.e5510-91  STDIN.o5510-13   STDIN.o5510-37
STDIN.o5510-49  STDIN.o5510-70  STDIN.o5510-81
STDIN.e5510-17   STDIN.e5510-37  STDIN.e5510-50  STDIN.e5510-71
STDIN.e5510-83  STDIN.e5510-92  STDIN.o5510-14   STDIN.o5510-38
STDIN.o5510-5   STDIN.o5510-71  STDIN.o5510-82

Darren Platt
Senior Director, Research
23andMe, inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080605/08c8d76c/attachment.html

More information about the torqueusers mailing list