[torqueusers] recovery behavior question
jwang at dataseekonline.com
Thu Feb 14 16:07:00 MST 2008
It's a brand new cluster, with a fresh CentOS 4.5 distro without the
distro's PBS, OFED or MPI stuff. The torque 2.2.1 source was downloaded
from Cluster Resources, the OFED hence OpenMPI from OpenFabrics, compiled
with the gcc compiler though we did try out the OFED with both gcc and PGI.
How can there be old binaries? It's clearly the executable that was
compiled in the path.
On 2/14/08 1:51 PM, "Garrick Staples" <garrick at usc.edu> wrote:
> On Thu, Feb 14, 2008 at 12:12:46PM -0600, John Wang alleged:
>> Hello Tim
>> So you're stopping the pbs_mom daemon on the compute nodes to prevent jobs
>> from running on them?
> That's not what he said at all.
>> That had been the practice here as well. It just seems to me that we
>> shouldn't have to use such work arounds.
> Something is weird with your install, probably old binaries hanging around.
> Please don't project this on to everyone else's systems.
>> On 2/13/08 8:48 PM, "Tim Freeman" <tfreeman at mcs.anl.gov> wrote:
>>> If I submit a job with all moms in the pool in the 'down' state, the job
>>> in the queue as expected. Then I bring up a node but the job in the queue
>>> not run until I submit another job (and they both do run).
>>> Is this expected? Is there a setting I am missing to get around this?
>>> (Torque 2.2.1, Maui 3.2.6p19 but I saw this with pbs_sched too)
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers