[torqueusers] Long hostnames on large clusters causing problems
garrick at usc.edu
Wed Feb 13 14:48:29 MST 2008
On Wed, Dec 05, 2007 at 09:56:21PM +0100, Roy Dragseth alleged:
> The default Rocks cluster setup use names like compute-x-y.local for the
> compute nodes. This seems to cause problems in torque when one wants to run
> a large job. The queing system becomes unusable when a user submit a large
> job, I have tried 4096 cpus, with this naming convention.
> If I submit a 4096 cpu job then this is what qstat shows:
> # qstat
> qstat: End of File
> Of course the quick fix is to shorten the hostnames and fortunately Rocks have
> shortname aliases of the form cx-y. Using this convention in the nodes file
> makes the 4096 cpu job run fine, but with the current growth of the cluster
> sizes it will not take long before even short-named clusters run into the
> same problem.
I just noticed this email. Sorry I missed it the first time around.
Do you know roughly where the break down occurs? I assume the job is submitted
correctly. Is the scheduler able to run it? Does the job actually start? Is
this only a problem with qstat?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080213/19a3392b/attachment.bin
More information about the torqueusers