[torqueusers] undelivered output of jobs
Darren Platt
darren at 23andme.com
Thu Jun 5 10:42:48 MDT 2008
I had exactly this problem, even though scp seemed to work fine. I gave up
in the end and just turned off checking
$rcpcmd /usr/bin/scp -o StrictHostKeyChecking=no
problem went away.
I have been dealing with a much more evil problem doing some benchmarking
though. I run 100 jobs on a cluster of 4 core machines (so load could be up
to 4 per node). If I run say jobs with a very short duration e.g 0.1
seconds, the stdout and stderr are copied back only partially. In some
tests as few as 80 out of 100 arrived successfully . T he missing files
were left on the nodes themselves. If I tried jobs with slightly longer
duration (e.g 5 seconds), it did better but
still only 95% returned. I assume the problem is some kind of denial of
service due to many jobs exitting synchronously but since I can't assume
things will exit nicely spaced out in production, this seems to be a
concern. I also did some stress testing of the cluster, killing and
resarting the mom processes to simulate node failure and saw a low but
significant rate where the same job ran twice on a node (once unsuccessfully
and once sucessfully) and the stdout was concatenated giving me two copies
of the desired output. Also not good for a production system. Any advice
or is using scp fundamentally a bad idea?
Darren
\
>
> You might make sure you can ssh a few ways back and forth to set the ssh
>> keys. I noticed I would ssh to the short hostname all the time and would get
>> stung by this since I hadn't ssh'd to the machine with the full hostname
>> too. Just a thought.
>>
>> -Steve
>>
>> On Jun 5, 2008, at 12:03 PM, Adrian Sevcenco wrote:
>>
>>
> I continue to be a bit confused on this topic, is passwordless ssh
> required both ways for root, the user or, both?
>
> Tony Schreiner
> Boston College
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
Darren Platt
Senior Director, Research
23andMe, inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080605/1de2adfe/attachment-0001.html
More information about the torqueusers
mailing list