[torqueusers] Job distributon does not what it is suppossed to do
vlad at cosy.sbg.ac.at
vlad at cosy.sbg.ac.at
Wed Sep 14 12:08:54 MDT 2011
Hi!
I'm using torque version 3.0.3-snap.201107121616
I have setup several queues on a cluster with nodes containing gpus
and other nodes with opteron CPUs .
I have assigned the property "gpunode" to every node containig the
Nvidia gpus and "opteron" to every node with our Magny Cours Opterons
(which lack of any GPUs..). (Manual of torque subsection 4.1.4)
One of my queues is called gpushort, the corespondent other ist optshort.
The jobs should be directed to gpus when queued into the gpushort, else
to the Opteron nodes if they are queued into "optshort".
I'm using now Maui as scheduler, but also have tried for a short time
pbs_sched with the same result.
This is my output of pbsnodes:
gpu01
state = free
np = 8
properties = i7,i7-new,gpunode,16G
ntype = cluster
status =
rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046
32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15
09:29:38 EDT 2011 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 2
gpu_status =
gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed
Sep 14 19:45:19 2011
gpu02
state = free
np = 8
properties = i7,12G,gpunode
ntype = cluster
status =
rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux
gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 2
gpu_status =
gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed
Sep 14 15:41:53 2011
gpu03
state = free
np = 8
properties = fermi,12G,gpunode,i7
ntype = cluster
status =
rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux
gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1
gpu_status =
gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14
15:43:29 2011
gpu04
state = free
np = 8
properties = i7,gpunode,12G
ntype = cluster
status =
rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux
gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43 2011
...
(and so on until gpu07..)
..
(and my Opterons, further below ..)
hex03
state = free
np = 14
properties = opteron
ntype = cluster
status =
rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux
hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
hex04
state = free
np = 14
properties = opteron
ntype = cluster
status =
rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux
hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
hex05
state = free
np = 14
properties = opteron
ntype = cluster
status =
rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux
hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
...
(until we get to hex14 ...)
hex14
state = free
np = 14
properties = opteron
ntype = cluster
status =
rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux
hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
the configuration of the qmgr for the 2 queues:
...
#
# Create and define queue gpushort
#
create queue gpushort
set queue gpushort queue_type = Execution
set queue gpushort resources_default.neednodes = gpunode
set queue gpushort resources_default.nodes = 1
set queue gpushort resources_default.walltime = 24:00:00
set queue gpushort enabled = True
set queue gpushort started = True
#
# Create and define queue optshort
#
create queue optshort
set queue optshort queue_type = Execution
set queue optshort resources_default.neednodes = opteron
set queue optshort resources_default.nodes = 1
set queue optshort resources_default.walltime = 24:00:00
set queue optshort enabled = True
set queue optshort started = True
#
...
Now, If you submit jobs to gpushort, they get executed on the gpunodes
(as it should be). If you choose to submit jobs to optshort, these are
supposed to be executed by the opterons, but ,instead of that, they are
found to be executed on the 1st gpunode (gpu01) as well.
How can I change this bad behaviour ?
I'm clueless...
Any help appreciated..
Greetings from Salzburg/Austria/Europe
Vlad Popa
University of Salzburg
Computer Science /HPC Computing
5020 Salzburg
Austria
Europe
More information about the torqueusers
mailing list