[torqueusers] undelivered output of jobs

Darren Platt darren at 23andme.com
Thu Jun 5 10:42:48 MDT 2008

I had exactly this problem, even though scp seemed to work fine.  I gave up
in the end and just turned off checking

$rcpcmd /usr/bin/scp -o StrictHostKeyChecking=no

problem went away.

I have been dealing with a much more evil problem doing some benchmarking
though.  I run 100 jobs on a cluster of 4 core machines (so load could be up
to 4 per node).  If I run say jobs with a very short duration e.g 0.1
seconds,  the stdout and stderr are copied back only  partially.  In some
tests as few as 80 out of 100 arrived successfully .  T he missing files
were left on the nodes themselves.  If I tried jobs with slightly longer
duration (e.g 5 seconds), it did better but
still only 95% returned.  I assume the problem is some kind of denial of
service due to many jobs exitting synchronously but since I can't assume
things will exit nicely spaced out in production, this seems to be a
concern.   I also did some stress testing of the cluster,  killing and
resarting the mom processes to simulate node failure and saw a low but
significant rate where the same job ran twice on a node (once unsuccessfully
and once sucessfully) and the stdout was concatenated giving me two copies
of the desired output.  Also not good for a production system.  Any advice
or is using scp fundamentally a bad idea?


>  You might make sure you can ssh a few ways back and forth to set the ssh
>> keys. I noticed I would ssh to the short hostname all the time and would get
>> stung by this since I hadn't ssh'd to the machine with the full hostname
>> too. Just a thought.
>> -Steve
>> On Jun 5, 2008, at 12:03 PM, Adrian Sevcenco wrote:
> I continue to be a bit confused on this topic,   is passwordless ssh
> required both ways for root, the user or, both?
> Tony Schreiner
> Boston College
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Darren Platt
Senior Director, Research
23andMe, inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080605/1de2adfe/attachment-0001.html

More information about the torqueusers mailing list