[torquedev] First Torque impressions on Altix

Michel Béland michel.beland at rqchp.qc.ca
Wed Dec 10 11:11:24 MST 2008


Hello,

We installed at our site version 2.3.5 of Torque, compiled with
--enable-cpuset on an Altix 3300 with four processors and 8 GB. The
machine has two nodes with two processors each. The memory is also split
in two between these nodes. This test installation is to see if Torque
could replace PBS Pro on our production Altix machines. Here are my
comments on the installation process and basic usage.

- Firstly, the server would do a segmentation fault. Running pbs_server
through Totalview, the debugger in use here, we found out that it
crashed in line 780 of file node_func.c, function initialize_pbsnode:

  for (i = 0;pul[i];i++)

The problem is that pul here is a NULL pointer, so this cannot work.
This variable is an unsigned long that should contain the ip address of
the node under scrutiny. In our case, their was only one compute node
and it was the same machine running pbs_server. It took us a while to
figure out that the machine had an extra 127.0.0.2 address in file
/etc/hosts. This entry seems to have been added by SUSE. Commenting it
out made pbs_server happy.

My complaint about this is that pbs_server should not do a segmentation
fault in a case like this. It should at least give an informative error
message or run anyway as all the other services did on this machine. I
would gladly suggest a patch, but I am not sure that I know how to fix
this problem.

- Secondly, pbs_sched would do a segmentation fault on startup. It would
try to read file resource_group, which had a default setting unsuitable
for our setup (groups and users that do not exist). Should this be
commented out in this file? This problem would happen even with option
fair_share set to false. I know that Maui and Moab are better, but the
first contact between a new user and Torque is with pbs_sched, so it
ought to at least not crash. Also, I realized that calls to strtok are
wrong in fairshare.c. The list of delimiters in the second argument
contains two spaces, while it should contain a space and a tab. This
might be the real cause of the crash that I faced. Did someone went too
far with copy and paste? I checked with a previous version of Torque,
namely 2.3.0, and the calls were correct. The same problem happens in
parse.c and prime.c, also in src/scheduler.cc/samples/fifo/.

In other parts of Torque, for example momctl/momctl.c, "\t" is used
instead of a real tab in the source code. I guess that fairshare.c,
parse.c and prime.c should use "\t" too.

- Thirdly, cpusets contain only cpu 0 when they are launched with -lncpus
instead of -lnodes. As -lncpus is intended for shared memory machines, I
think that it ought to work correctly. With older versions of PBS Pro,
probably more similar to Torque then  than it is today, using -lnodes on
Altix did not work quite well: memory requests were ignored. I do not
know if this problem would appear with Torque, though.

- Fourthly, when I submit a sequential job followed by a 2-cpu job, the
first jobs gets a cpuset with cpu 0 and the second a cpuset with cpus 1
and 2. This is pretty annoying: the second job should get cpus 2 and 3
so that they are on the same node. In fact, if cpus 0 and 2 were busy, I
would expect the job to remain queued. I realize that this means good
knowledge of the cpusets by the scheduler. As I want to make this work
with pbs_sched or Maui because of budget constraints, the best way to
make this work for me is to make sure that all the jobs use one or more
complete node. As we already use a dummy qsub script with PBS Pro, we
might as well change it to make sure that all the jobs use complete
nodes for cpus and memory. No change would then be needed in the scheduler.

- Fifthly, the cpusets contain all the nodes for memory, instead of just
the nodes needed according to the memory request. I guess that I can
probably easily change Torque to restrict the memory, provided that I
use the dummy qsub script described above.


-- 
Michel Béland, analyste en calcul scientifique
michel.beland at rqchp.qc.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone   : 514 343-6111 poste 3892     télécopieur : 514 343-2155
RQCHP (Réseau québécois de calcul de haute performance)  www.rqchp.qc.ca



More information about the torquedev mailing list