[torqueusers] nodes file persistent gpus setting

Sreedhar Manchu sm4082 at nyu.edu
Thu Sep 12 10:52:12 MDT 2013


Hi,

Just yesterday I installed Torque 4.2.5 on a new GPU cluster and the issue mentioned by Simon (see the thread below) is still present in this version. Is there any fix/hack to get around this problem?

Whenever I restart pbs_server it just deletes the string 'gpus=8' from all the lines. Or like Simon  mentioned just running pbsnodes would do the same.

Sreedhar.


On Jul 4, 2012, at 9:52 PM, Simon Brennan <simon.brennan at ersa.edu.au> wrote:

> Sean (my colleague) and I have still been banging our head against the wall with this issue.
> 
> We've got torque 3.0.4 with gpu support enabled, Cuda 4.0.
> 
> After some testing on my local desktop and a 17 node GPU cluster (mixture of Tesla cards and GTX cards) we've found that if you have a nodes file with the gpus= and attributes, plus you make a change to state of a node (pbsnodes -o / pbsnodes -r) that has both gpus= and an attribute, for some crazy unknown reason the nodes file is modified, all gpus= lines are removed and any # comments. 
> Entries that only have a gpus= or an attribute aren't affected, only one nodes that have both.
> 
> Why is there even code in Torque (specifically pbs_server) that is capable of writing to the nodes file!!
> 
> Some examples.... BLAH is just a random node attribute
> 
> Test1
> -=-=-=-=-=-=-=
> nodes file contents:
> node1 np=1 gpus=2 BLAH
> node2 np=2
> 
> start torque server and mom.
> #pbsnodes -r node2    (File doesn't change)
> #pbsnodes -r node1     (File changes after command is run, stat on file confirms this)
> 
> nodes file contents
> node1 np=1 BLAH
> node2 np=2
> =-=-=-=-=-=-=
> 
> Test2
> -=-=-=-=-=-=-=
> nodes file contents:
> node1 np=1 gpus=2 BLAH
> node2 np=2 gpus=2
> 
> start torque server and mom.
> #pbsnodes -r node2    (File doesn't change)
> #pbsnodes -r node1     (File changes after command is run, stat on file confirms this)
> 
> nodes file contents
> node1 np=1 BLAH
> node2 np=2
> =-=-=-=-=-=-=
> 
> Test3
> -=-=-=-=-=-=-=
> nodes file contents:
> node1 np=1 BLAH
> node2 np=2 gpus=2
> 
> start torque server and mom.
> #pbsnodes -r node2    (File changes after command is run, stat on file confirms this)
> 
> nodes file contents
> node1 np=1 BLAH
> node2 np=2
> =-=-=-=-=-=-=
> 
> Test4
> -=-=-=-=-=-=-=
> nodes file contents:
> node1 np=1 gpus=2 
> node2 np=2 gpus=2
> 
> start torque server and mom.
> #pbsnodes -r node2    (File doesn't change)
> #pbsnodes -r node1    (File doesn't change)
> 
> nodes file contents
> node1 np=1 gpus=2
> node2 np=2 gpus=2
> =-=-=-=-=-=-=
> 
> Regards
> Simon Brennan
> 
> 
> -------- Original Message --------
> Subject:	Re: [torqueusers] nodes file persistent gpus setting
> Date:	Thu, 17 May 2012 15:50:09 +1000
> From:	<Gareth.Williams at csiro.au>
> Reply-To:	Torque Users Mailing List <torqueusers at supercluster.org>
> To:	<torqueusers at supercluster.org>
> 
> 
> HI Sean, Woah – we are _not_ using the integrated nvidia gpu support (so far anyway).  Perhaps that wasn’t actually the problem on your system – are you really sure that solved the problem and was not just a coincidence? We have nvidia drivers (on that compute node) but no other nvidia software on this system.
>  
> Gareth
>  
> From: Sean Reilly [mailto:sean.reilly at ersa.edu.au] 
> Sent: Thursday, 17 May 2012 12:21 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] nodes file persistent gpus setting
>  
> Hi Gareth
> 
> We saw the same behaviour when we enabled the tdk-1.285  libraries on the GPU backend Nodes in the ld.config path.
> 
> - It is needed on the CPU (non-gpu) Nodes
> - But when added to the PATH  on the GPU Nodes - the PBS_MOM complains about something missing (*Sorry I cant remember what it is  - but it may have been some nvidia or  nvc nvq  type library*) 
>    - Then the PBS_MOM rewrites the nodes file on the server side.
>       *removing the gpus=   or truncating the line from where 'gpus=' is written* 
> 
> this was fixed by commenting out these libs on the GPU backend Node.
> 
> /etc/ld.so.conf.d/tdk.conf 
> #This file was made by puppet, do not edit it directly!
> #/opt/shared/tdk/1.285/lib64
> #/opt/shared/tdk/1.285/lib
> 
> 
> Regards
> Sean
> 
> 
> 
> On 17/05/12 05:56, Ken Nielson wrote:
> On Sun, Apr 1, 2012 at 7:36 PM, <Gareth.Williams at csiro.au> wrote:
> Hi,
> 
> Can anyone confirm the following behavior (bug)?
> 
> If you give a node gpus like so:
>  qmgr -c 'set node gpunode01 gpus = 2'
> or in the nodes file
>  gpunode01 np=12 gpus=2
> Then the node has (logical) gpus defined and they can be scheduled as in:
> http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php
> (though 1.5.3 doesn't mention specifying both np= and gpus= which I suspect needs fixing).
> 
> This setup works fine for us until we restart the pbs_server at which time the gpus disappear (you can see this in the output of pbsnodes). The nodes file gets altered to remove the gpus= setting.
> 
> Note that we are using version 3.0.3-snap.xxx and NOT the integrated nvidia gpu support.
> 
> Does anyone else see the behavior?  You don't need physical gpus to test, just a system you are prepared to mess with a little including restarting the pbs_server.
> 
> Regards,
> 
> Gareth
> 
> Gareth,
> 
> Have you entered a ticket in bugzilla for this.
> 
> Ken
> 
>  
>  
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>  
> 
> -- 
> Sean Reilly
> 
> Systems Administrator & Applications Support Officer
> eResearchSA
> Phone : +61 8 8313 8352
> Mobile: +61 450 840 246
> 
> <Mail Attachment.png>
> 
> 
> 
> <Attached Message Part.txt>_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130912/015bef16/attachment-0001.html 


More information about the torqueusers mailing list