[torqueusers] Re: [torquedev] First Torque impressions on Altix

Glen Beane glen.beane at gmail.com
Wed Dec 10 11:46:44 MST 2008


On Wed, Dec 10, 2008 at 1:11 PM, Michel Béland
<michel.beland at rqchp.qc.ca> wrote:
> Hello,
>
> We installed at our site version 2.3.5 of Torque, compiled with
> --enable-cpuset on an Altix 3300 with four processors and 8 GB. The
> machine has two nodes with two processors each. The memory is also split
> in two between these nodes. This test installation is to see if Torque
> could replace PBS Pro on our production Altix machines. Here are my
> comments on the installation process and basic usage.
>
> - Firstly, the server would do a segmentation fault. Running pbs_server
> through Totalview, the debugger in use here, we found out that it
> crashed in line 780 of file node_func.c, function initialize_pbsnode:
>
>  for (i = 0;pul[i];i++)
>
> The problem is that pul here is a NULL pointer, so this cannot work.
> This variable is an unsigned long that should contain the ip address of
> the node under scrutiny. In our case, their was only one compute node
> and it was the same machine running pbs_server. It took us a while to
> figure out that the machine had an extra 127.0.0.2 address in file
> /etc/hosts. This entry seems to have been added by SUSE. Commenting it
> out made pbs_server happy.
>
> My complaint about this is that pbs_server should not do a segmentation
> fault in a case like this. It should at least give an informative error
> message or run anyway as all the other services did on this machine. I
> would gladly suggest a patch, but I am not sure that I know how to fix
> this problem.
>
> - Secondly, pbs_sched would do a segmentation fault on startup. It would
> try to read file resource_group, which had a default setting unsuitable
> for our setup (groups and users that do not exist). Should this be
> commented out in this file? This problem would happen even with option
> fair_share set to false. I know that Maui and Moab are better, but the
> first contact between a new user and Torque is with pbs_sched, so it
> ought to at least not crash. Also, I realized that calls to strtok are
> wrong in fairshare.c. The list of delimiters in the second argument
> contains two spaces, while it should contain a space and a tab. This
> might be the real cause of the crash that I faced. Did someone went too
> far with copy and paste? I checked with a previous version of Torque,
> namely 2.3.0, and the calls were correct. The same problem happens in
> parse.c and prime.c, also in src/scheduler.cc/samples/fifo/.
>
> In other parts of Torque, for example momctl/momctl.c, "\t" is used
> instead of a real tab in the source code. I guess that fairshare.c,
> parse.c and prime.c should use "\t" too.

I'll make a quick comment to this...
I will go through and try to fix fairshare.c, parse.c, and prime.c in
SVN sometime today.

What happened was fairly recently someone ran the torque code through
astyle to apply a uniform source formatting style.  This removed tabs,
and unfortunately a long time ago some programmer had included tabs in
strings instead of \t.   This was caught and fixed in momctl.c, but
apparently not in the files you mention.


> - Thirdly, cpusets contain only cpu 0 when they are launched with -lncpus
> instead of -lnodes. As -lncpus is intended for shared memory machines, I
> think that it ought to work correctly. With older versions of PBS Pro,
> probably more similar to Torque then  than it is today, using -lnodes on
> Altix did not work quite well: memory requests were ignored. I do not
> know if this problem would appear with Torque, though.
>
> - Fourthly, when I submit a sequential job followed by a 2-cpu job, the
> first jobs gets a cpuset with cpu 0 and the second a cpuset with cpus 1
> and 2. This is pretty annoying: the second job should get cpus 2 and 3
> so that they are on the same node. In fact, if cpus 0 and 2 were busy, I
> would expect the job to remain queued. I realize that this means good
> knowledge of the cpusets by the scheduler. As I want to make this work
> with pbs_sched or Maui because of budget constraints, the best way to
> make this work for me is to make sure that all the jobs use one or more
> complete node. As we already use a dummy qsub script with PBS Pro, we
> might as well change it to make sure that all the jobs use complete
> nodes for cpus and memory. No change would then be needed in the scheduler.
>
> - Fifthly, the cpusets contain all the nodes for memory, instead of just
> the nodes needed according to the memory request. I guess that I can
> probably easily change Torque to restrict the memory, provided that I
> use the dummy qsub script described above.
>
>
> --
> Michel Béland, analyste en calcul scientifique
> michel.beland at rqchp.qc.ca
> bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
> téléphone   : 514 343-6111 poste 3892     télécopieur : 514 343-2155
> RQCHP (Réseau québécois de calcul de haute performance)  www.rqchp.qc.ca
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>


More information about the torqueusers mailing list