[torqueusers] Server not talking to MOMs at all
jacksond at clusterresources.com
Fri Sep 9 14:27:11 MDT 2005
We are unaware of other crashes anywhere on any system. I would be
very interested in determining what is unique about your policies,
environment, workload, or resources. If you are running anything more
recent than patch 5, please catch these failures under gdb 'where' or
with valgrind. We will get them fixed immediately.
On Thu, 2005-09-08 at 10:52 +0200, Lennart Karlsson wrote:
> Troy, you wrote:
> > BTW, the $headnode terminology IMHO doesn't make sense in an environment
> > where the system running pbs_server (and likely pbs_sched/maui/moab as
> > well) is *NOT* a login node, although I'm not sure if that describes
> > anyone's system structure other than ours. In any case, I would argue
> > that $pbsserver (or alternately $serverhost) is much more descriptive.
> We also have such environments, with login nodes as separate nodes from
> the Maui/PBS servers, and I support your naming alternatives.
> > I don't see how you can make multi-server support work
> > without a shared filesystem (with working locks, i.e. not NFS) for
> > $PBS_HOME and the addition of a whole bunch of cluster membership and
> > voting code to pbs_server. It seems like overkill if what we're really
> > looking for is simply a high-availability pbs_server.
> > What exactly are people looking for WRT multi-server support in TORQUE,
> > anyway?
> The state of Torque today, in our environments, is that it behaves
> badly or crashes (e.g. the current favourite: eats all available internal
> and swap memory) in one way or another very frequently (I appreciate, without
> measuring, that it is a factor of more than 100 times) compared to the
> underlying Linux plus hardware, so we see no reason to set up two PBS servers
> for one cluster. Torque is developing into a more robust software, but there
> is still a long way to go.
> I believe that more "bells and whistles" on Torque might make it less
> available. We need to change Torque, but we have to make the right
> choices and we have to be careful not to put "everything" into it.
> So I would like Torque to be less complicated, not more. I vote NO to
> multi-server support.
> Best wishes,
> -- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
> National Supercomputer Centre in Linkoping, Sweden
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers