[torqueusers] Job won't start when gpus=1 requested.

Peter A. Gustafson, PhD peter.gustafson at wmich.edu
Thu Aug 22 21:52:14 MDT 2013


OK. Thanks for the info. Somehow I missed that about Maui in the documentation.  I'll try GRES.
Pete


Matteo Ragni <matteo.ragni.it at gmail.com> wrote:
>Maui doens't support gpus request. We have solved this issue
>recompiling
>Maui with support to gpu as a general purpose consumable resources
>(GRES).
>There's a patch for this:
>
>http://www.clusterresources.com/pipermail/mauiusers/2008-August/003486.html
>
>
>
>
>
>2013/8/16 Peter A. Gustafson <peter.gustafson at wmich.edu>
>
>>  Hi all,
>> I'm trying to manage the gpu resources.  My nodes file appears to be
>correct
>> and pbsnodes report that gpus are present.  However, when I submit
>requesting
>> gpus the job enters a deferred state.  The queue appears to allow
>gpuuse.  Any
>> suggestions?
>>
>> Many thanks,
>> Pete
>>
>> Torque version: 2.5.10
>> Maui version: 3.3.1
>>
>> Example below:
>>
>> # pbsnodes n10
>> n10
>>      state = free
>>      np = 16
>>      properties = research,k20
>>      ntype = cluster
>>      status =
>>
>rectime=1376676818,varattr=,jobs=,state=free,netload=50681542816,gres=,loadave=0.00,ncpus=16,physmem=132272332kb,availmem=139195740kb,totmem=140666252kb,idletime=5204925,nusers=0,nsessions=?
>> 0,sessions=? 0,uname=Linux n10 2.6.32-279.2.1.el6.631g0000.x86_64 #1
>SMP
>> Sun Jul 22 22:39:16 EDT 2012 x86_64,opsys=linux
>>      gpus = 1
>>
>> set queue abaqus queue_type = Execution
>> set queue abaqus Priority = 20
>> set queue abaqus max_running = 2
>> set queue abaqus resources_max.nodes = 1:ppn=8:gpus=1
>> set queue abaqus resources_min.nodes = 1
>> set queue abaqus resources_default.nodes = 1:ppn=4:gpus=1
>> set queue abaqus resources_default.walltime = 02:00:00
>> set queue abaqus keep_completed = 300
>> set queue abaqus enabled = True
>> set queue abaqus started = True
>> #
>>
>>
>>
>> When submission includes:
>> #PBS -l nodes=1:ppn=1:k20
>> it runs fine.
>>
>> When submission includes:
>> #PBS -l nodes=1:ppn=1:gpus=1:k20
>> I get deferred for no resources as below.
>>
>> $ checkjob 1901[1]
>> checking job 1901[1]
>>
>> State: Idle  EState: Deferred
>> Creds:  user:gustafson  group:pi  class:abaqus  qos:DEFAULT
>> WallTime: 00:00:00 of 41:16:00:00
>> SubmitTime: Fri Aug 16 14:17:19
>>   (Time Queued  Total: 00:02:09  Eligible: 00:00:22)
>>
>> Total Tasks: 1
>>
>> Req[0]  TaskCount: 1  Partition: ALL
>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>> Opsys: [NONE]  Arch: [NONE]  Features: [k20][gpus=1]
>> Dedicated Resources Per Task: PROCS: 1  MEM: 100G
>>
>>
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 0
>> PartitionMask: [ALL]
>> Flags:       RESTARTABLE
>>
>> job is deferred.  Reason:  NoResources  (cannot create reservation
>for job
>> '1901[1]' (intital reservation attempt)
>> )
>> Holds:    Defer  (hold reason:  NoResources)
>> PE:  11.71  StartPriority:  1
>> cannot select job 1901[1] for partition DEFAULT (job hold active)
>>
>>
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130822/09189599/attachment.html 


More information about the torqueusers mailing list