[torqueusers] Long hostnames on large clusters causing problems
Roy.Dragseth at cc.uit.no
Tue Apr 8 02:29:05 MDT 2008
On Sunday 17 February 2008, Roy Dragseth wrote:
> On Wednesday 13 February 2008, Garrick Staples wrote:
> > On Wed, Dec 05, 2007 at 09:56:21PM +0100, Roy Dragseth alleged:
> > > The default Rocks cluster setup use names like compute-x-y.local for
> > > the compute nodes. This seems to cause problems in torque when one
> > > wants to run a large job. The queing system becomes unusable when a
> > > user submit a large job, I have tried 4096 cpus, with this naming
> > > convention.
> > >
> > > If I submit a 4096 cpu job then this is what qstat shows:
> > >
> > > # qstat
> > > qstat: End of File
> > >
> > > Of course the quick fix is to shorten the hostnames and fortunately
> > > Rocks have shortname aliases of the form cx-y. Using this convention
> > > in the nodes file makes the 4096 cpu job run fine, but with the current
> > > growth of the cluster sizes it will not take long before even
> > > short-named clusters run into the same problem.
> > I just noticed this email. Sorry I missed it the first time around.
> > Do you know roughly where the break down occurs? I assume the job is
> > submitted correctly.
> I did not trace it down to any exact number of nodes, and with the fix
> above the system is now in production so I cannot check this out right now.
> Yes, the job submission was correct.
> > Is the scheduler able to run it?
> > Does the job actually start?
> > Is this only a problem with qstat?
> I do not remember if all commands failed or if qdel etc worked, sorry.
> My plan is to start testing torque 2.2.X soon (this is on 2.1.8), I might
> be able to set up a parallel installation to test if this is still present.
> But it might take some time before I can start working on this.
I've just enabled a test installation with torque-2.3.0 and the problem seems
to have been solved. I can now run a job on 5520 cores/690 nodes with
compute node names like compute-X-Y.local without torque bailing out on me.
(Maui crashes though, but I'll take that to the maui list, shorter host names
on the format cX-Y works fine.)
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the torqueusers