[torqueusers] Serious torque failure problems
Stewart.Samuels at sanofi-aventis.com
Thu Aug 11 12:03:04 MDT 2005
By default, torque/pbs use rpp (udp). I have been thinking about
disabling rpp which then forces torque to use tcp instead of udp. I am
curious about how many torque sites do this? Since udp does NOT
guarantee packet transfer and tcp does, I would think tcp should be the
default. The obvious drawback is that because udp does not have the
overhead tcp does, higher performance can be achieved. So, it seems a
trade-off between communication success and performance. On the other
hand, if packet communication is not robust, what good is performance.
How many sites disable rpp? And why have you elected to do so?
On Thu, 2005-08-11 at 11:24, Marc Langlois wrote:
> On Thu, 2005-08-11 at 07:52, Chris Johnson wrote:
> > Hi,
> > Running torque-1.2.0p2 here on CentOS 4.X across a cluster composed
> > of AMDs, P4s, XEONs, and Opterons, several hundred total.
> > At the moment torque is becoming useless. We are using the out of
> > box the C scheduler and it keeps dieing completely. We are seeing jobs
> > running on nodes which they are not listed in any log as officially having
> > been run on. The scheduler gets hung up on problem nodes and tries to
> > run many many jobs on the bad node. The scheduler/server can't pick up
> > on the fact that a node is bad due to a situation in which mom seems to
> > respond but other linux services have failed. And my boss is about ready
> > to trash torque. Can't blame him, we're spending way too much time
> > maintaining this cluster. Researchers aren't any too thrilled either.
> > As fas as I know torque is used in a lot of places. And I don't hear
> > about these problems other places. What the hell is going on? I REALLY
> > need to get this corrected. And I'll provide any information I can.
> > But my community is about ready to start telling people what not to use for
> > cluster operations.
> > Help would be GREATLY appreciated.
> Hi Chris,
> I had a similar experience with PBS a few years ago. Although it could
> be a bit dated, I found that the default C scheduler worked fine for
> testing, but as soon as I rolled it into production, it failed
> miserably. Switching to the Maui scheduler solved all my problems.
> Another thing that helped was using the "--disable-rpp" flag when
> running configure for torque. It seems that RPP was flooding the network
> with UDP traffic that hung the PBS server and scheduler (we were running
> on Solaris).
> As far finding an alternative system, I recently gave SGE a try to see
> how it compared to torque. Could be that I'm familiar with the PBS way,
> but SGE had it's own set of quirks that I couldn't get around, so I
> dropped it and came back to torque.
> Hope this helps. Good luck!
More information about the torqueusers