[torqueusers] Job won't start when gpus=1 requested.

Matteo Ragni matteo.ragni.it at gmail.com
Thu Aug 22 02:38:24 MDT 2013


Maui doens't support gpus request. We have solved this issue recompiling
Maui with support to gpu as a general purpose consumable resources (GRES).
There's a patch for this:

http://www.clusterresources.com/pipermail/mauiusers/2008-August/003486.html





2013/8/16 Peter A. Gustafson <peter.gustafson at wmich.edu>

>  Hi all,
> I'm trying to manage the gpu resources.  My nodes file appears to be correct
> and pbsnodes report that gpus are present.  However, when I submit requesting
> gpus the job enters a deferred state.  The queue appears to allow gpuuse.  Any
> suggestions?
>
> Many thanks,
> Pete
>
> Torque version: 2.5.10
> Maui version: 3.3.1
>
> Example below:
>
> # pbsnodes n10
> n10
>      state = free
>      np = 16
>      properties = research,k20
>      ntype = cluster
>      status =
> rectime=1376676818,varattr=,jobs=,state=free,netload=50681542816,gres=,loadave=0.00,ncpus=16,physmem=132272332kb,availmem=139195740kb,totmem=140666252kb,idletime=5204925,nusers=0,nsessions=?
> 0,sessions=? 0,uname=Linux n10 2.6.32-279.2.1.el6.631g0000.x86_64 #1 SMP
> Sun Jul 22 22:39:16 EDT 2012 x86_64,opsys=linux
>      gpus = 1
>
> set queue abaqus queue_type = Execution
> set queue abaqus Priority = 20
> set queue abaqus max_running = 2
> set queue abaqus resources_max.nodes = 1:ppn=8:gpus=1
> set queue abaqus resources_min.nodes = 1
> set queue abaqus resources_default.nodes = 1:ppn=4:gpus=1
> set queue abaqus resources_default.walltime = 02:00:00
> set queue abaqus keep_completed = 300
> set queue abaqus enabled = True
> set queue abaqus started = True
> #
>
>
>
> When submission includes:
> #PBS -l nodes=1:ppn=1:k20
> it runs fine.
>
> When submission includes:
> #PBS -l nodes=1:ppn=1:gpus=1:k20
> I get deferred for no resources as below.
>
> $ checkjob 1901[1]
> checking job 1901[1]
>
> State: Idle  EState: Deferred
> Creds:  user:gustafson  group:pi  class:abaqus  qos:DEFAULT
> WallTime: 00:00:00 of 41:16:00:00
> SubmitTime: Fri Aug 16 14:17:19
>   (Time Queued  Total: 00:02:09  Eligible: 00:00:22)
>
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [k20][gpus=1]
> Dedicated Resources Per Task: PROCS: 1  MEM: 100G
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
> job is deferred.  Reason:  NoResources  (cannot create reservation for job
> '1901[1]' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  11.71  StartPriority:  1
> cannot select job 1901[1] for partition DEFAULT (job hold active)
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130822/0923b344/attachment-0001.html 


More information about the torqueusers mailing list