[torqueusers] Long hostnames on large clusters causing problems in torque.

Garrick Staples garrick at usc.edu
Wed Feb 13 14:48:29 MST 2008


On Wed, Dec 05, 2007 at 09:56:21PM +0100, Roy Dragseth alleged:
> The default Rocks cluster setup use names like compute-x-y.local for the 
> compute nodes.  This seems to cause problems in torque when one wants to run 
> a large job.  The queing system becomes unusable when a user submit a large 
> job, I have tried 4096 cpus, with this naming convention.  
> 
> If I submit a 4096 cpu job then this is what qstat shows:
> 
> # qstat
> qstat: End of File
> 
> Of course the quick fix is to shorten the hostnames and fortunately Rocks have 
> shortname aliases of the form cx-y.  Using this convention in the nodes file 
> makes the 4096 cpu job run fine, but with the current growth of the cluster 
> sizes it will not take long before even short-named clusters run into the 
> same problem.

I just noticed this email.  Sorry I missed it the first time around.

Do you know roughly where the break down occurs?  I assume the job is submitted
correctly.  Is the scheduler able to run it?  Does the job actually start?  Is
this only a problem with qstat?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080213/19a3392b/attachment.bin


More information about the torqueusers mailing list