[torqueusers] Job distributon does not what it is suppossed to do
Vlad Popa
vlad at cosy.sbg.ac.at
Wed Sep 14 23:52:49 MDT 2011
Am 2011-09-14 22:39, schrieb Ken Nielson:
> ----- Original Message -----
>> From: vlad at cosy.sbg.ac.at
>> To: torqueusers at supercluster.org
>> Sent: Wednesday, September 14, 2011 12:08:54 PM
>> Subject: [torqueusers] Job distributon does not what it is suppossed to do
>>
>> Hi!
>>
>> I'm using torque version 3.0.3-snap.201107121616
>> I have setup several queues on a cluster with nodes containing
>> gpus
>> and other nodes with opteron CPUs .
>>
>> I have assigned the property "gpunode" to every node containig the
>> Nvidia gpus and "opteron" to every node with our Magny Cours
>> Opterons
>> (which lack of any GPUs..). (Manual of torque subsection 4.1.4)
>>
>> One of my queues is called gpushort, the corespondent other ist
>> optshort.
>> The jobs should be directed to gpus when queued into the gpushort,
>> else
>> to the Opteron nodes if they are queued into "optshort".
>>
>> I'm using now Maui as scheduler, but also have tried for a short
>> time
>> pbs_sched with the same result.
>>
>> This is my output of pbsnodes:
>>
>> gpu01
>> state = free
>> np = 8
>> properties = i7,i7-new,gpunode,16G
>> ntype = cluster
>> status =
>> rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046
>> 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15
>> 09:29:38 EDT 2011 x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 2
>> gpu_status =
>> gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed
>> Sep 14 19:45:19 2011
>>
>> gpu02
>> state = free
>> np = 8
>> properties = i7,12G,gpunode
>> ntype = cluster
>> status =
>> rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux
>> gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 2
>> gpu_status =
>> gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed
>> Sep 14 15:41:53 2011
>>
>> gpu03
>> state = free
>> np = 8
>> properties = fermi,12G,gpunode,i7
>> ntype = cluster
>> status =
>> rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux
>> gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 1
>> gpu_status =
>> gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14
>> 15:43:29 2011
>>
>> gpu04
>> state = free
>> np = 8
>> properties = i7,gpunode,12G
>> ntype = cluster
>> status =
>> rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux
>> gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 0
>> gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43
>> 2011
>>
>> ...
>> (and so on until gpu07..)
>>
>> ..
>> (and my Opterons, further below ..)
>>
>> hex03
>> state = free
>> np = 14
>> properties = opteron
>> ntype = cluster
>> status =
>> rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux
>> hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 0
>>
>> hex04
>> state = free
>> np = 14
>> properties = opteron
>> ntype = cluster
>> status =
>> rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux
>> hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 0
>>
>> hex05
>> state = free
>> np = 14
>> properties = opteron
>> ntype = cluster
>> status =
>> rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux
>> hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 0
>>
>> ...
>> (until we get to hex14 ...)
>> hex14
>> state = free
>> np = 14
>> properties = opteron
>> ntype = cluster
>> status =
>> rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux
>> hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
>> x86_64,opsys=linux
>> mom_service_port = 15002
>> mom_manager_port = 15003
>> gpus = 0
>>
>> the configuration of the qmgr for the 2 queues:
>>
>> ...
>> #
>> # Create and define queue gpushort
>> #
>> create queue gpushort
>> set queue gpushort queue_type = Execution
>> set queue gpushort resources_default.neednodes = gpunode
>> set queue gpushort resources_default.nodes = 1
>> set queue gpushort resources_default.walltime = 24:00:00
>> set queue gpushort enabled = True
>> set queue gpushort started = True
>> #
>> # Create and define queue optshort
>> #
>> create queue optshort
>> set queue optshort queue_type = Execution
>> set queue optshort resources_default.neednodes = opteron
>> set queue optshort resources_default.nodes = 1
>> set queue optshort resources_default.walltime = 24:00:00
>> set queue optshort enabled = True
>> set queue optshort started = True
>> #
>> ...
>>
>> Now, If you submit jobs to gpushort, they get executed on the
>> gpunodes
>> (as it should be). If you choose to submit jobs to optshort,
>> these are
>> supposed to be executed by the opterons, but ,instead of that, they
>> are
>> found to be executed on the 1st gpunode (gpu01) as well.
>>
>> How can I change this bad behaviour ?
>>
>> I'm clueless...
>>
>> Any help appreciated..
>>
>> Greetings from Salzburg/Austria/Europe
>>
>> Vlad Popa
>>
>> University of Salzburg
>> Computer Science /HPC Computing
>> 5020 Salzburg
>> Austria
>> Europe
> We need someone to modify Maui to support GPUs. pbs_sched also does not support GPUs currently. Currently, only Moab knows about GPUs at the scheduler level.
Yes, might be, but still my jobs in the queues are not directed to the
right "property-nodes". I don't think, it would change, if I chose
different property names.
More information about the torqueusers
mailing list