[torqueusers] Job distributon does not what it is suppossed to do

Vlad Popa vlad at cosy.sbg.ac.at
Wed Sep 14 23:52:49 MDT 2011


Am 2011-09-14 22:39, schrieb Ken Nielson:
> ----- Original Message -----
>> From: vlad at cosy.sbg.ac.at
>> To: torqueusers at supercluster.org
>> Sent: Wednesday, September 14, 2011 12:08:54 PM
>> Subject: [torqueusers] Job distributon does not what it is suppossed to do
>>
>> Hi!
>>
>> I'm using torque   version 3.0.3-snap.201107121616
>> I have   setup   several  queues on a cluster with  nodes containing
>> gpus
>> and other nodes  with opteron CPUs .
>>
>> I have assigned  the property "gpunode"   to every node containig the
>> Nvidia gpus and  "opteron" to  every node  with  our Magny Cours
>> Opterons
>> (which lack of  any GPUs..). (Manual of torque subsection 4.1.4)
>>
>> One of my queues is  called gpushort, the corespondent other ist
>> optshort.
>> The jobs  should be directed  to gpus when  queued into the gpushort,
>> else
>> to the Opteron nodes  if  they are queued into  "optshort".
>>
>> I'm using now Maui as  scheduler, but also have tried for a short
>> time
>> pbs_sched with the same result.
>>
>> This is my output of pbsnodes:
>>
>> gpu01
>>       state = free
>>       np = 8
>>       properties = i7,i7-new,gpunode,16G
>>       ntype = cluster
>>       status =
>> rectime=1316029218,varattr=,jobs=,state=free,netload=36006443357,gres=,loadave=4.00,ncpus=8,physmem=16315316kb,availmem=44122752kb,totmem=49083308kb,idletime=9431,nusers=1,nsessions=2,sessions=5046
>> 32314,uname=Linux gpu01 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15
>> 09:29:38 EDT 2011 x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 2
>>       gpu_status =
>> gpu[1]=gpu_id=0000:06:00.0;,gpu[0]=gpu_id=0000:05:00.0;,driver_ver=280.13,timestamp=Wed
>> Sep 14 19:45:19 2011
>>
>> gpu02
>>       state = free
>>       np = 8
>>       properties = i7,12G,gpunode
>>       ntype = cluster
>>       status =
>> rectime=1316029233,varattr=,jobs=,state=free,netload=59511138356,gres=,loadave=3.99,ncpus=8,physmem=12187556kb,availmem=40056024kb,totmem=44955548kb,idletime=10142,nusers=0,nsessions=0,uname=Linux
>> gpu02 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 2
>>       gpu_status =
>> gpu[1]=gpu_id=0000:08:00.0;,gpu[0]=gpu_id=0000:07:00.0;,driver_ver=280.13,timestamp=Wed
>> Sep 14 15:41:53 2011
>>
>> gpu03
>>       state = free
>>       np = 8
>>       properties = fermi,12G,gpunode,i7
>>       ntype = cluster
>>       status =
>> rectime=1316029202,varattr=,jobs=,state=free,netload=4100691397,gres=,loadave=4.00,ncpus=8,physmem=12189608kb,availmem=41308600kb,totmem=44957600kb,idletime=7941,nusers=0,nsessions=0,uname=Linux
>> gpu03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 1
>>       gpu_status =
>> gpu[0]=gpu_id=0000:02:00.0;,driver_ver=280.13,timestamp=Wed Sep 14
>> 15:43:29 2011
>>
>> gpu04
>>       state = free
>>       np = 8
>>       properties = i7,gpunode,12G
>>       ntype = cluster
>>       status =
>> rectime=1316029210,varattr=,jobs=,state=free,netload=39234422480,gres=,loadave=4.00,ncpus=8,physmem=12187556kb,availmem=40432932kb,totmem=44955548kb,idletime=463010,nusers=0,nsessions=0,uname=Linux
>> gpu04 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 0
>>       gpu_status = driver_ver=UNKNOWN,timestamp=Wed Sep 14 21:46:43
>>       2011
>>
>> ...
>> (and so on until gpu07..)
>>
>> ..
>> (and my  Opterons, further below ..)
>>
>> hex03
>>       state = free
>>       np = 14
>>       properties = opteron
>>       ntype = cluster
>>       status =
>> rectime=1316029213,varattr=,jobs=,state=free,netload=22449278232,gres=,loadave=0.04,ncpus=16,physmem=32877076kb,availmem=97422188kb,totmem=98413068kb,idletime=7007,nusers=0,nsessions=0,uname=Linux
>> hex03 2.6.32-131.6.1.el6.x86_64 #1 SMP Fri Jul 15 09:29:38 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 0
>>
>> hex04
>>       state = free
>>       np = 14
>>       properties = opteron
>>       ntype = cluster
>>       status =
>> rectime=1316029218,varattr=,jobs=,state=free,netload=72995554028,gres=,loadave=0.03,ncpus=16,physmem=32876308kb,availmem=83822708kb,totmem=98412300kb,idletime=7106,nusers=0,nsessions=0,uname=Linux
>> hex04 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 0
>>
>> hex05
>>       state = free
>>       np = 14
>>       properties = opteron
>>       ntype = cluster
>>       status =
>> rectime=1316029221,varattr=,jobs=,state=free,netload=101419420599,gres=,loadave=0.00,ncpus=16,physmem=32876308kb,availmem=83854984kb,totmem=98412300kb,idletime=791803,nusers=0,nsessions=0,uname=Linux
>> hex05 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 0
>>
>> ...
>> (until  we get to hex14 ...)
>> hex14
>>       state = free
>>       np = 14
>>       properties = opteron
>>       ntype = cluster
>>       status =
>> rectime=1316030058,varattr=,jobs=,state=free,netload=24497857045,gres=,loadave=0.09,ncpus=16,physmem=32876308kb,availmem=83878088kb,totmem=98412300kb,idletime=706625,nusers=0,nsessions=0,uname=Linux
>> hex14 2.6.32-131.12.1.el6.x86_64 #1 SMP Tue Aug 23 10:52:23 EDT 2011
>> x86_64,opsys=linux
>>       mom_service_port = 15002
>>       mom_manager_port = 15003
>>       gpus = 0
>>
>> the configuration of the   qmgr for the 2 queues:
>>
>> ...
>> #
>> # Create and define queue gpushort
>> #
>> create queue gpushort
>> set queue gpushort queue_type = Execution
>> set queue gpushort resources_default.neednodes = gpunode
>> set queue gpushort resources_default.nodes = 1
>> set queue gpushort resources_default.walltime = 24:00:00
>> set queue gpushort enabled = True
>> set queue gpushort started = True
>> #
>> # Create and define queue optshort
>> #
>> create queue optshort
>> set queue optshort queue_type = Execution
>> set queue optshort resources_default.neednodes = opteron
>> set queue optshort resources_default.nodes = 1
>> set queue optshort resources_default.walltime = 24:00:00
>> set queue optshort enabled = True
>> set queue optshort started = True
>> #
>> ...
>>
>> Now, If you submit jobs to gpushort, they get  executed on the
>> gpunodes
>> (as it should be).  If you  choose to submit jobs  to  optshort,
>> these are
>> supposed to be executed by the opterons, but ,instead of that,  they
>> are
>> found to be executed on the 1st gpunode (gpu01) as well.
>>
>> How can  I change this bad behaviour ?
>>
>> I'm clueless...
>>
>> Any help appreciated..
>>
>> Greetings from Salzburg/Austria/Europe
>>
>> Vlad Popa
>>
>> University of Salzburg
>> Computer Science /HPC Computing
>> 5020 Salzburg
>> Austria
>> Europe
> We need someone to modify Maui to support GPUs. pbs_sched also does not support GPUs currently. Currently, only Moab knows about GPUs at the scheduler level.
Yes, might be, but still my jobs in the queues are not directed to the 
right "property-nodes". I don't think, it would  change, if I chose 
different property names.



More information about the torqueusers mailing list