[torqueusers] Job distributon does not what it is suppossed to do

vlad at cosy.sbg.ac.at vlad at cosy.sbg.ac.at
Wed Sep 14 12:08:54 MDT 2011


Hi!

I'm using torque   version 3.0.3-snap.201107121616
I have   setup   several  queues on a cluster with  nodes containing gpus
and other nodes  with opteron CPUs .

I have assigned  the property "gpunode"   to every node containig the
Nvidia gpus and  "opteron" to  every node  with  our Magny Cours Opterons
(which lack of  any GPUs..). (Manual of torque subsection 4.1.4)

One of my queues is  called gpushort, the corespondent other ist optshort.
The jobs  should be directed  to gpus when  queued into the gpushort, else
to the Opteron nodes  if  they are queued into  "optshort".

I'm using now Maui as  scheduler, but also have tried for a short time  
pbs_sched with the same result.

This is my output of pbsnodes:

gpu01
     state = free
     np = 8
     properties = i7,i7-new,gpunode,16G
     ntype = cluster
     status =
rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046
32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15
09:29:38 EDT 2011 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 2
     gpu_status =
gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed
Sep 14 19:45:19 2011

gpu02
     state = free
     np = 8
     properties = i7,12G,gpunode
     ntype = cluster
     status =
rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux
gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 2
     gpu_status =
gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed
Sep 14 15:41:53 2011

gpu03
     state = free
     np = 8
     properties = fermi,12G,gpunode,i7
     ntype = cluster
     status =
rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux
gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1
     gpu_status =
gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14
15:43:29 2011

gpu04
     state = free
     np = 8
     properties = i7,gpunode,12G
     ntype = cluster
     status =
rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux
gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0
     gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43 2011

...
(and so on until gpu07..)

..
(and my  Opterons, further below ..)

hex03
     state = free
     np = 14
     properties = opteron
     ntype = cluster
     status =
rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux
hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

hex04
     state = free
     np = 14
     properties = opteron
     ntype = cluster
     status =
rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux
hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

hex05
     state = free
     np = 14
     properties = opteron
     ntype = cluster
     status =
rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux
hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

...
(until  we get to hex14 ...)
hex14
     state = free
     np = 14
     properties = opteron
     ntype = cluster
     status =
rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux
hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

the configuration of the   qmgr for the 2 queues:

...
#
# Create and define queue gpushort
#
create queue gpushort
set queue gpushort queue_type = Execution
set queue gpushort resources_default.neednodes = gpunode
set queue gpushort resources_default.nodes = 1
set queue gpushort resources_default.walltime = 24:00:00
set queue gpushort enabled = True
set queue gpushort started = True
#
# Create and define queue optshort
#
create queue optshort
set queue optshort queue_type = Execution
set queue optshort resources_default.neednodes = opteron
set queue optshort resources_default.nodes = 1
set queue optshort resources_default.walltime = 24:00:00
set queue optshort enabled = True
set queue optshort started = True
#
...

Now, If you submit jobs to gpushort, they get  executed on the gpunodes
(as it should be).  If you  choose to submit jobs  to  optshort, these are
supposed to be executed by the opterons, but ,instead of that,  they are
found to be executed on the 1st gpunode (gpu01) as well.

How can  I change this bad behaviour ?

I'm clueless...

Any help appreciated..

Greetings from Salzburg/Austria/Europe

Vlad Popa

University of Salzburg
Computer Science /HPC Computing
5020 Salzburg
Austria
Europe


More information about the torqueusers mailing list