[torqueusers] Long hostnames on large clusters causing problems
roy.dragseth at cc.uit.no
Sat Feb 16 17:27:22 MST 2008
On Wednesday 13 February 2008, Garrick Staples wrote:
> On Wed, Dec 05, 2007 at 09:56:21PM +0100, Roy Dragseth alleged:
> > The default Rocks cluster setup use names like compute-x-y.local for the
> > compute nodes. This seems to cause problems in torque when one wants to
> > run a large job. The queing system becomes unusable when a user submit a
> > large job, I have tried 4096 cpus, with this naming convention.
> > If I submit a 4096 cpu job then this is what qstat shows:
> > # qstat
> > qstat: End of File
> > Of course the quick fix is to shorten the hostnames and fortunately Rocks
> > have shortname aliases of the form cx-y. Using this convention in the
> > nodes file makes the 4096 cpu job run fine, but with the current growth
> > of the cluster sizes it will not take long before even short-named
> > clusters run into the same problem.
> I just noticed this email. Sorry I missed it the first time around.
> Do you know roughly where the break down occurs? I assume the job is
> submitted correctly.
I did not trace it down to any exact number of nodes, and with the fix above
the system is now in production so I cannot check this out right now.
Yes, the job submission was correct.
> Is the scheduler able to run it?
> Does the job actually start?
> Is this only a problem with qstat?
I do not remember if all commands failed or if qdel etc worked, sorry.
My plan is to start testing torque 2.2.X soon (this is on 2.1.8), I might be
able to set up a parallel installation to test if this is still present. But
it might take some time before I can start working on this.
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the torqueusers