[torqueusers] Job distributon does not what it is suppossed to do
Coyle, James J [ITACD]
jjc at iastate.edu
Thu Sep 15 14:29:54 MDT 2011
Vlad Popa,
I don't have users submit to a specific queue, I just have them specify needed
resources, and let a routing queue decide what queue to run them in.
You can do this in Maui or just plain old pbs_sched.
If your users specify queues, you would have a qsub something like:
qsub -q optshort -lnodes=2:ppn=16,walltime=1:00,vmem=32GB,pmem=2GB,mem=32GB ./script
If they specify resources, this would be like:
qsub -lnodes=2:ppn=16:opteron,walltime=1:00,vmem=32GB,pmem=3GB,mem=32GB ./script
I let the default queue be a routing queue:
set server default_queue = routing_queue
set queue routing_queue queue_type = Route
Set up routing from it into 5 queues:
set queue routing_queue route_destinations = optshort
set queue routing_queue route_destinations += gpushort
set queue routing_queue route_destinations += medium
set queue routing_queue route_destinations += large_short
set queue routing_queue route_destinations += large
And set all 5 queues to be from_route_only
set queue large_short from_route_only = True
set queue large from_route_only = True
set queue medium from_route_only = True
set queue optshort from_route_only = True
set queue gpushort from_route_only = True
Then the jobs traverses the list in order down until it can
satisfy all resource requirements,
even :opteron
or
:gpunode
I created a wqeb page form my users. In this case I'd simply
have radio buttons for
need gpus? no/yes
need opterons? no/yes
need I7? no/yes
Then the web page could generate the correct #PBS line.
James Coyle, PhD
High Performance Computing Group
Iowa State Univ.
web: http://jjc.public.iastate.edu/
>-----Original Message-----
>From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>bounces at supercluster.org] On Behalf Of Vlad Popa
>Sent: Thursday, September 15, 2011 12:53 AM
>To: Torque Users Mailing List
>Subject: Re: [torqueusers] Job distributon does not what it is
>suppossed to do
>
>Am 2011-09-14 22:39, schrieb Ken Nielson:
>> ----- Original Message -----
>>> From: vlad at cosy.sbg.ac.at
>>> To: torqueusers at supercluster.org
>>> Sent: Wednesday, September 14, 2011 12:08:54 PM
>>> Subject: [torqueusers] Job distributon does not what it is
>suppossed to do
>>>
>>> Hi!
>>>
>>> I'm using torque version 3.0.3-snap.201107121616
>>> I have setup several queues on a cluster with nodes
>containing
>>> gpus
>>> and other nodes with opteron CPUs .
>>>
>>> I have assigned the property "gpunode" to every node containig
>the
>>> Nvidia gpus and "opteron" to every node with our Magny Cours
>>> Opterons
>>> (which lack of any GPUs..). (Manual of torque subsection 4.1.4)
>>>
>>> One of my queues is called gpushort, the corespondent other ist
>>> optshort.
>>> The jobs should be directed to gpus when queued into the
>gpushort,
>>> else
>>> to the Opteron nodes if they are queued into "optshort".
>>>
>>> I'm using now Maui as scheduler, but also have tried for a short
>>> time
>>> pbs_sched with the same result.
>>>
>>> This is my output of pbsnodes:
>>>
>>> gpu01
>>> state = free
>>> np = 8
>>> properties = i7,i7-new,gpunode,16G
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gre
>s=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totme
>m=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046
>>> 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul
>15
>>> 09:29:38 EDT 2011 x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 2
>>> gpu_status =
>>>
>gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=2
>80.13,timestamp=Wed
>>> Sep 14 19:45:19 2011
>>>
>>> gpu02
>>> state = free
>>> np = 8
>>> properties = i7,12G,gpunode
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gre
>s=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totme
>m=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux
>>> gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 2
>>> gpu_status =
>>>
>gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=2
>80.13,timestamp=Wed
>>> Sep 14 15:41:53 2011
>>>
>>> gpu03
>>> state = free
>>> np = 8
>>> properties = fermi,12G,gpunode,i7
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres
>=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem
>=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux
>>> gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 1
>>> gpu_status =
>>> gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep
>14
>>> 15:43:29 2011
>>>
>>> gpu04
>>> state = free
>>> np = 8
>>> properties = i7,gpunode,12G
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gre
>s=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totme
>m=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux
>>> gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 0
>>> gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14
>21:46:43
>>> 2011
>>>
>>> ...
>>> (and so on until gpu07..)
>>>
>>> ..
>>> (and my Opterons, further below ..)
>>>
>>> hex03
>>> state = free
>>> np = 14
>>> properties = opteron
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gre
>s=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totm
>em=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux
>>> hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 0
>>>
>>> hex04
>>> state = free
>>> np = 14
>>> properties = opteron
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gre
>s=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totm
>em=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux
>>> hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 0
>>>
>>> hex05
>>> state = free
>>> np = 14
>>> properties = opteron
>>> ntype = cluster
>>> status =
>>>
>rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gr
>es=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,tot
>mem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux
>>> hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 0
>>>
>>> ...
>>> (until we get to hex14 ...)
>>> hex14
>>> state = free
>>> np = 14
>>> properties = opteron
>>> ntype = cluster
>>> status =
>>>
>rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gre
>s=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totm
>em=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux
>>> hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT
>2011
>>> x86_64,opsys=linux
>>> mom_service_port = 15002
>>> mom_manager_port = 15003
>>> gpus = 0
>>>
>>> the configuration of the qmgr for the 2 queues:
>>>
>>> ...
>>> #
>>> # Create and define queue gpushort
>>> #
>>> create queue gpushort
>>> set queue gpushort queue_type = Execution
>>> set queue gpushort resources_default.neednodes = gpunode
>>> set queue gpushort resources_default.nodes = 1
>>> set queue gpushort resources_default.walltime = 24:00:00
>>> set queue gpushort enabled = True
>>> set queue gpushort started = True
>>> #
>>> # Create and define queue optshort
>>> #
>>> create queue optshort
>>> set queue optshort queue_type = Execution
>>> set queue optshort resources_default.neednodes = opteron
>>> set queue optshort resources_default.nodes = 1
>>> set queue optshort resources_default.walltime = 24:00:00
>>> set queue optshort enabled = True
>>> set queue optshort started = True
>>> #
>>> ...
>>>
>>> Now, If you submit jobs to gpushort, they get executed on the
>>> gpunodes
>>> (as it should be). If you choose to submit jobs to optshort,
>>> these are
>>> supposed to be executed by the opterons, but ,instead of that,
>they
>>> are
>>> found to be executed on the 1st gpunode (gpu01) as well.
>>>
>>> How can I change this bad behaviour ?
>>>
>>> I'm clueless...
>>>
>>> Any help appreciated..
>>>
>>> Greetings from Salzburg/Austria/Europe
>>>
>>> Vlad Popa
>>>
>>> University of Salzburg
>>> Computer Science /HPC Computing
>>> 5020 Salzburg
>>> Austria
>>> Europe
>> We need someone to modify Maui to support GPUs. pbs_sched also
>does not support GPUs currently. Currently, only Moab knows about
>GPUs at the scheduler level.
>Yes, might be, but still my jobs in the queues are not directed to
>the
>right "property-nodes". I don't think, it would change, if I chose
>different property names.
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list