[torqueusers] Problem with nodes allocation
Roger Williams
R.Williams at gns.cri.nz
Tue Jul 1 17:30:33 MDT 2008
In recent months on this list there have been a couple of reports of
people getting only one node allocated regardless of the number of nodes
requested (with #PBS -l nodes=x). There was a report from Michael Marti in
March and a tantalising mail from Nicola Guida in June.
Can any of the Torque developers say if the cause of this problem was
definitively identified and, if so, has it been fixed?
I see the same thing. I have a new(ish) Altix XE cluster (SLES 10) and
Torque 2.2.0. The cluster has 1 head node and 20 compute nodes with
pbs_server and pbs_sched on the head node and pbs_mom on the compute nodes
(and not on the head node). Dual internal gigabit ethernet (first for NFS,
second for Torque and MPI). Cluster interconnect (passwordless ssh and
rsh) is fine in all combinations.
Simple scripts with, eg
#PBS -l nodes=5
produce a $PBS_NODEFILE with just the first of the 20 nodes.
However, if I specify nodes explicitly by name, eg
#PBS -l nodes=cl1n010-gige+cl1n011-gige
then the nodes list is correct.
In the first example, I also see the same thing that Michael reported with
the output of 'qstat -n'. That is, the NDS column has the correct count
(matching nodes=) but only the first node is listed. I also concur with
his observation that "The ppn part of the node request however seems to work ok".
Can anyone say what is going on?
--
Roger Williams, GNS Science, New Zealand : www.gns.cri.nz : xyzzy
More information about the torqueusers
mailing list