[torqueusers] nodes file persistent gpus setting

Simon Brennan simon.brennan at ersa.edu.au
Wed Jul 4 19:52:54 MDT 2012


Sean (my colleague) and I have still been banging our head against the 
wall with this issue.

We've got torque 3.0.4 with gpu support enabled, Cuda 4.0.

After some testing on my local desktop and a 17 node GPU cluster 
(mixture of Tesla cards and GTX cards) we've found that if you have a 
nodes file with the gpus= and attributes, plus you make a change to 
state of a node (pbsnodes -o / pbsnodes -r) that has both gpus= and an 
attribute, for some crazy unknown reason the nodes file is modified, all 
gpus= lines are removed and any # comments.
Entries that only have a gpus= or an attribute aren't affected, only one 
nodes that have both.

Why is there even code in Torque (specifically pbs_server) that is 
capable of writing to the nodes file!!

Some examples.... BLAH is just a random node attribute

Test1
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 gpus=2 BLAH
node2 np=2

start torque server and mom.
#pbsnodes -r node2    (File doesn't change)
#pbsnodes -r node1     (File changes after command is run, stat on file 
confirms this)

nodes file contents
node1 np=1 BLAH
node2 np=2
=-=-=-=-=-=-=

Test2
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 gpus=2 BLAH
node2 np=2 gpus=2

start torque server and mom.
#pbsnodes -r node2    (File doesn't change)
#pbsnodes -r node1     (File changes after command is run, stat on file 
confirms this)

nodes file contents
node1 np=1 BLAH
node2 np=2
=-=-=-=-=-=-=

Test3
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 BLAH
node2 np=2 gpus=2

start torque server and mom.
#pbsnodes -r node2    (File changes after command is run, stat on file 
confirms this)

nodes file contents
node1 np=1 BLAH
node2 np=2
=-=-=-=-=-=-=

Test4
-=-=-=-=-=-=-=
nodes file contents:
node1 np=1 gpus=2
node2 np=2 gpus=2

start torque server and mom.
#pbsnodes -r node2    (File doesn't change)
#pbsnodes -r node1    (File doesn't change)

nodes file contents
node1 np=1 gpus=2
node2 np=2 gpus=2
=-=-=-=-=-=-=

Regards
Simon Brennan


-------- Original Message --------
Subject: 	Re: [torqueusers] nodes file persistent gpus setting
Date: 	Thu, 17 May 2012 15:50:09 +1000
From: 	<Gareth.Williams at csiro.au>
Reply-To: 	Torque Users Mailing List <torqueusers at supercluster.org>
To: 	<torqueusers at supercluster.org>



HI Sean, Woah -- we are _/not/_ using the integrated nvidia gpu support 
(so far anyway).  Perhaps that wasn't actually the problem on your 
system -- are you really sure that solved the problem and was not just a 
coincidence? We have nvidia drivers (on that compute node) but no other 
nvidia software on this system.

Gareth

*From:*Sean Reilly [mailto:sean.reilly at ersa.edu.au]
*Sent:* Thursday, 17 May 2012 12:21 PM
*To:* Torque Users Mailing List
*Subject:* Re: [torqueusers] nodes file persistent gpus setting

Hi Gareth

We saw the same behaviour when we enabled the tdk-1.285 libraries on the 
GPU backend Nodes in the ld.config path.

- It is needed on the CPU (non-gpu) Nodes
- But when added to the PATH  on the GPU Nodes - the PBS_MOM complains 
about something missing (*Sorry I cant remember what it is  - but it may 
have been some nvidia or  nvc nvq type library*)
    - Then the PBS_MOM rewrites the nodes file on the server side.
       *removing the gpus=   or truncating the line from where 'gpus=' 
is written*

this was fixed by commenting out these libs on the GPU backend Node.

/etc/ld.so.conf.d/tdk.conf
#This file was made by puppet, do not edit it directly!
#/opt/shared/tdk/1.285/lib64
#/opt/shared/tdk/1.285/lib


Regards
Sean



On 17/05/12 05:56, Ken Nielson wrote:

On Sun, Apr 1, 2012 at 7:36 PM, <Gareth.Williams at csiro.au 
<mailto:Gareth.Williams at csiro.au>> wrote:

Hi,

Can anyone confirm the following behavior (bug)?

If you give a node gpus like so:
  qmgr -c 'set node gpunode01 gpus = 2'
or in the nodes file
  gpunode01 np=12 gpus=2
Then the node has (logical) gpus defined and they can be scheduled as in:
http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php
(though 1.5.3 doesn't mention specifying both np= and gpus= which I 
suspect needs fixing).

This setup works fine for us until we restart the pbs_server at which 
time the gpus disappear (you can see this in the output of pbsnodes). 
The nodes file gets altered to remove the gpus= setting.

Note that we are using version 3.0.3-snap.xxx and NOT the integrated 
nvidia gpu support.

Does anyone else see the behavior?  You don't need physical gpus to 
test, just a system you are prepared to mess with a little including 
restarting the pbs_server.

Regards,

Gareth


Gareth,

Have you entered a ticket in bugzilla for this.

Ken

  

  

_______________________________________________

torqueusers mailing list

torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>

http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
*Sean Reilly*

Systems Administrator & Applications Support Officer
eResearchSA
Phone : +61 8 8313 8352
Mobile: +61 450 840 246

<http://www.ersa.edu.au/moving>




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120705/bb37de9c/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 10004 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120705/bb37de9c/attachment-0001.png 
-------------- next part --------------
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list