[torqueusers] Long hostnames on large clusters causing problems in torque.

Roy Dragseth Roy.Dragseth at cc.uit.no
Tue Apr 8 02:29:05 MDT 2008


On Sunday 17 February 2008, Roy Dragseth wrote:
> On Wednesday 13 February 2008, Garrick Staples wrote:
> > On Wed, Dec 05, 2007 at 09:56:21PM +0100, Roy Dragseth alleged:
> > > The default Rocks cluster setup use names like compute-x-y.local for
> > > the compute nodes.  This seems to cause problems in torque when one
> > > wants to run a large job.  The queing system becomes unusable when a
> > > user submit a large job, I have tried 4096 cpus, with this naming
> > > convention.
> > >
> > > If I submit a 4096 cpu job then this is what qstat shows:
> > >
> > > # qstat
> > > qstat: End of File
> > >
> > > Of course the quick fix is to shorten the hostnames and fortunately
> > > Rocks have shortname aliases of the form cx-y.  Using this convention
> > > in the nodes file makes the 4096 cpu job run fine, but with the current
> > > growth of the cluster sizes it will not take long before even
> > > short-named clusters run into the same problem.
> >
> > I just noticed this email.  Sorry I missed it the first time around.
> >
> > Do you know roughly where the break down occurs?  I assume the job is
> > submitted correctly.
>
> I did not trace it down to any exact number of nodes, and with the fix
> above the system is now in production so I cannot check this out right now.
> Yes, the job submission was correct.
>
> > Is the scheduler able to run it?
>
> No.
>
> > Does the job  actually start?
>
> No.
>
> > Is this only a problem with qstat?
>
> I do not remember if all commands failed or if qdel etc worked, sorry.
>
> My plan is to start testing torque 2.2.X soon (this is on 2.1.8), I might
> be able to set up a parallel installation to test if this is still present.
>  But it might take some time before I can start working on this.
>

I've just enabled a test installation with torque-2.3.0 and the problem seems 
to have been solved.  I can now run a job on 5520 cores/690 nodes  with 
compute node names like compute-X-Y.local without torque bailing out on me.  

(Maui crashes though, but I'll take that to the maui list, shorter host names 
on the format cX-Y works fine.)

r.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: royd at cc.uit.no


More information about the torqueusers mailing list