[torqueusers] nodes file persistent gpus setting
Simon Brennan
simon.brennan at ersa.edu.au
Wed Jul 4 19:52:54 MDT 2012
Sean (my colleague) and I have still been banging our head against the
wall with this issue.
We've got torque 3.0.4 with gpu support enabled, Cuda 4.0.
After some testing on my local desktop and a 17 node GPU cluster
(mixture of Tesla cards and GTX cards) we've found that if you have a
nodes file with the gpus= and attributes, plus you make a change to
state of a node (pbsnodes -o / pbsnodes -r) that has both gpus= and an
attribute, for some crazy unknown reason the nodes file is modified, all
gpus= lines are removed and any # comments.
Entries that only have a gpus= or an attribute aren't affected, only one
nodes that have both.
Why is there even code in Torque (specifically pbs_server) that is
capable of writing to the nodes file!!
Some examples.... BLAH is just a random node attribute
Test1
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 gpus=2 BLAH
node2 np=2
start torque server and mom.
#pbsnodes -r node2 (File doesn't change)
#pbsnodes -r node1 (File changes after command is run, stat on file
confirms this)
nodes file contents
node1 np=1 BLAH
node2 np=2
=-=-=-=-=-=-=
Test2
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 gpus=2 BLAH
node2 np=2 gpus=2
start torque server and mom.
#pbsnodes -r node2 (File doesn't change)
#pbsnodes -r node1 (File changes after command is run, stat on file
confirms this)
nodes file contents
node1 np=1 BLAH
node2 np=2
=-=-=-=-=-=-=
Test3
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 BLAH
node2 np=2 gpus=2
start torque server and mom.
#pbsnodes -r node2 (File changes after command is run, stat on file
confirms this)
nodes file contents
node1 np=1 BLAH
node2 np=2
=-=-=-=-=-=-=
Test4
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 gpus=2
node2 np=2 gpus=2
start torque server and mom.
#pbsnodes -r node2 (File doesn't change)
#pbsnodes -r node1 (File doesn't change)
nodes file contents
node1 np=1 gpus=2
node2 np=2 gpus=2
=-=-=-=-=-=-=
Regards
Simon Brennan
-------- Original Message --------
Subject: Re: [torqueusers] nodes file persistent gpus setting
Date: Thu, 17 May 2012 15:50:09 +1000
From: <Gareth.Williams at csiro.au>
Reply-To: Torque Users Mailing List <torqueusers at supercluster.org>
To: <torqueusers at supercluster.org>
HI Sean, Woah -- we are _/not/_ using the integrated nvidia gpu support
(so far anyway). Perhaps that wasn't actually the problem on your
system -- are you really sure that solved the problem and was not just a
coincidence? We have nvidia drivers (on that compute node) but no other
nvidia software on this system.
Gareth
*From:*Sean Reilly [mailto:sean.reilly at ersa.edu.au]
*Sent:* Thursday, 17 May 2012 12:21 PM
*To:* Torque Users Mailing List
*Subject:* Re: [torqueusers] nodes file persistent gpus setting
Hi Gareth
We saw the same behaviour when we enabled the tdk-1.285 libraries on the
GPU backend Nodes in the ld.config path.
- It is needed on the CPU (non-gpu) Nodes
- But when added to the PATH on the GPU Nodes - the PBS_MOM complains
about something missing (*Sorry I cant remember what it is - but it may
have been some nvidia or nvc nvq type library*)
- Then the PBS_MOM rewrites the nodes file on the server side.
*removing the gpus= or truncating the line from where 'gpus='
is written*
this was fixed by commenting out these libs on the GPU backend Node.
/etc/ld.so.conf.d/tdk.conf
#This file was made by puppet, do not edit it directly!
#/opt/shared/tdk/1.285/lib64
#/opt/shared/tdk/1.285/lib
Regards
Sean
On 17/05/12 05:56, Ken Nielson wrote:
On Sun, Apr 1, 2012 at 7:36 PM, <Gareth.Williams at csiro.au
<mailto:Gareth.Williams at csiro.au>> wrote:
Hi,
Can anyone confirm the following behavior (bug)?
If you give a node gpus like so:
qmgr -c 'set node gpunode01 gpus = 2'
or in the nodes file
gpunode01 np=12 gpus=2
Then the node has (logical) gpus defined and they can be scheduled as in:
http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php
(though 1.5.3 doesn't mention specifying both np= and gpus= which I
suspect needs fixing).
This setup works fine for us until we restart the pbs_server at which
time the gpus disappear (you can see this in the output of pbsnodes).
The nodes file gets altered to remove the gpus= setting.
Note that we are using version 3.0.3-snap.xxx and NOT the integrated
nvidia gpu support.
Does anyone else see the behavior? You don't need physical gpus to
test, just a system you are prepared to mess with a little including
restarting the pbs_server.
Regards,
Gareth
Gareth,
Have you entered a ticket in bugzilla for this.
Ken
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
--
*Sean Reilly*
Systems Administrator & Applications Support Officer
eResearchSA
Phone : +61 8 8313 8352
Mobile: +61 450 840 246
<http://www.ersa.edu.au/moving>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120705/bb37de9c/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 10004 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120705/bb37de9c/attachment-0001.png
-------------- next part --------------
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list