[torqueusers] Job distributon does not what it is suppossed to do

Ken Nielson knielson at adaptivecomputing.com
Wed Sep 14 14:39:39 MDT 2011


----- Original Message -----
> From: vlad at cosy.sbg.ac.at
> To: torqueusers at supercluster.org
> Sent: Wednesday, September 14, 2011 12:08:54 PM
> Subject: [torqueusers] Job distributon does not what it is suppossed to do
> 
> Hi!
> 
> I'm using torque   version 3.0.3-snap.201107121616
> I have   setup   several  queues on a cluster with  nodes containing
> gpus
> and other nodes  with opteron CPUs .
> 
> I have assigned  the property "gpunode"   to every node containig the
> Nvidia gpus and  "opteron" to  every node  with  our Magny Cours
> Opterons
> (which lack of  any GPUs..). (Manual of torque subsection 4.1.4)
> 
> One of my queues is  called gpushort, the corespondent other ist
> optshort.
> The jobs  should be directed  to gpus when  queued into the gpushort,
> else
> to the Opteron nodes  if  they are queued into  "optshort".
> 
> I'm using now Maui as  scheduler, but also have tried for a short
> time
> pbs_sched with the same result.
> 
> This is my output of pbsnodes:
> 
> gpu01
>      state = free
>      np = 8
>      properties = i7,i7-new,gpunode,16G
>      ntype = cluster
>      status =
> rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046
> 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15
> 09:29:38 EDT 2011 x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 2
>      gpu_status =
> gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed
> Sep 14 19:45:19 2011
> 
> gpu02
>      state = free
>      np = 8
>      properties = i7,12G,gpunode
>      ntype = cluster
>      status =
> rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux
> gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 2
>      gpu_status =
> gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed
> Sep 14 15:41:53 2011
> 
> gpu03
>      state = free
>      np = 8
>      properties = fermi,12G,gpunode,i7
>      ntype = cluster
>      status =
> rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux
> gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 1
>      gpu_status =
> gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14
> 15:43:29 2011
> 
> gpu04
>      state = free
>      np = 8
>      properties = i7,gpunode,12G
>      ntype = cluster
>      status =
> rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux
> gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
>      gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43
>      2011
> 
> ...
> (and so on until gpu07..)
> 
> ..
> (and my  Opterons, further below ..)
> 
> hex03
>      state = free
>      np = 14
>      properties = opteron
>      ntype = cluster
>      status =
> rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux
> hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
> 
> hex04
>      state = free
>      np = 14
>      properties = opteron
>      ntype = cluster
>      status =
> rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux
> hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
> 
> hex05
>      state = free
>      np = 14
>      properties = opteron
>      ntype = cluster
>      status =
> rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux
> hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
> 
> ...
> (until  we get to hex14 ...)
> hex14
>      state = free
>      np = 14
>      properties = opteron
>      ntype = cluster
>      status =
> rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux
> hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
> x86_64,opsys=linux
>      mom_service_port = 15002
>      mom_manager_port = 15003
>      gpus = 0
> 
> the configuration of the   qmgr for the 2 queues:
> 
> ...
> #
> # Create and define queue gpushort
> #
> create queue gpushort
> set queue gpushort queue_type = Execution
> set queue gpushort resources_default.neednodes = gpunode
> set queue gpushort resources_default.nodes = 1
> set queue gpushort resources_default.walltime = 24:00:00
> set queue gpushort enabled = True
> set queue gpushort started = True
> #
> # Create and define queue optshort
> #
> create queue optshort
> set queue optshort queue_type = Execution
> set queue optshort resources_default.neednodes = opteron
> set queue optshort resources_default.nodes = 1
> set queue optshort resources_default.walltime = 24:00:00
> set queue optshort enabled = True
> set queue optshort started = True
> #
> ...
> 
> Now, If you submit jobs to gpushort, they get  executed on the
> gpunodes
> (as it should be).  If you  choose to submit jobs  to  optshort,
> these are
> supposed to be executed by the opterons, but ,instead of that,  they
> are
> found to be executed on the 1st gpunode (gpu01) as well.
> 
> How can  I change this bad behaviour ?
> 
> I'm clueless...
> 
> Any help appreciated..
> 
> Greetings from Salzburg/Austria/Europe
> 
> Vlad Popa
> 
> University of Salzburg
> Computer Science /HPC Computing
> 5020 Salzburg
> Austria
> Europe

We need someone to modify Maui to support GPUs. pbs_sched also does not support GPUs currently. Currently, only Moab knows about GPUs at the scheduler level.

Ken Nielson
Adaptive Computing



More information about the torqueusers mailing list