[torqueusers] Long hostnames on large clusters causing problems in torque.

Roy Dragseth roy.dragseth at cc.uit.no
Sat Feb 16 17:27:22 MST 2008


On Wednesday 13 February 2008, Garrick Staples wrote:
> On Wed, Dec 05, 2007 at 09:56:21PM +0100, Roy Dragseth alleged:
> > The default Rocks cluster setup use names like compute-x-y.local for the
> > compute nodes.  This seems to cause problems in torque when one wants to
> > run a large job.  The queing system becomes unusable when a user submit a
> > large job, I have tried 4096 cpus, with this naming convention.
> >
> > If I submit a 4096 cpu job then this is what qstat shows:
> >
> > # qstat
> > qstat: End of File
> >
> > Of course the quick fix is to shorten the hostnames and fortunately Rocks
> > have shortname aliases of the form cx-y.  Using this convention in the
> > nodes file makes the 4096 cpu job run fine, but with the current growth
> > of the cluster sizes it will not take long before even short-named
> > clusters run into the same problem.
>
> I just noticed this email.  Sorry I missed it the first time around.
>
> Do you know roughly where the break down occurs?  I assume the job is
> submitted correctly.  

I did not trace it down to any exact number of nodes, and with the fix above 
the system is now in production so I cannot check this out right now.
Yes, the job submission was correct.


> Is the scheduler able to run it?  

No.

> Does the job  actually start?  

No.

> Is this only a problem with qstat? 

I do not remember if all commands failed or if qdel etc worked, sorry.

My plan is to start testing torque 2.2.X soon (this is on 2.1.8), I might be 
able to set up a parallel installation to test if this is still present.  But 
it might take some time before I can start working on this.


Regards,
r.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
              phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, Team Leader, High Performance Computing
         Direct call: +47 77 64 62 56. email: royd at cc.uit.no


More information about the torqueusers mailing list