[torqueusers] Job won't start when gpus=1 requested.

Peter A. Gustafson peter.gustafson at wmich.edu
Fri Aug 16 12:42:58 MDT 2013


Hiall,
I'm trying to manage the gpu resources.  My nodes file appears to be
correct and pbsnodes report that gpus are present.  However, when I
submit requesting gpus the job enters a deferred state.  The queue
appears to allow gpuuse.  Any suggestions?

Many thanks,
Pete

Torque version: 2.5.10
Maui version: 3.3.1

Example below:

# pbsnodes n10
n10
     state = free
     np = 16
     properties = research,k20
     ntype = cluster
     status =
rectime=1376676818,varattr=,jobs=,state=free,netload=50681542816,gres=,loadave=0.00,ncpus=16,physmem=132272332kb,availmem=139195740kb,totmem=140666252kb,idletime=5204925,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux n10 2.6.32-279.2.1.el6.631g0000.x86_64 #1 SMP
Sun Jul 22 22:39:16 EDT 2012 x86_64,opsys=linux
     gpus = 1

set queue abaqus queue_type = Execution
set queue abaqus Priority = 20
set queue abaqus max_running = 2
set queue abaqus resources_max.nodes = 1:ppn=8:gpus=1
set queue abaqus resources_min.nodes = 1
set queue abaqus resources_default.nodes = 1:ppn=4:gpus=1
set queue abaqus resources_default.walltime = 02:00:00
set queue abaqus keep_completed = 300
set queue abaqus enabled = True
set queue abaqus started = True
#



When submission includes:
#PBS -l nodes=1:ppn=1:k20
it runs fine.

When submission includes:
#PBS -l nodes=1:ppn=1:gpus=1:k20
I get deferred for no resources as below.

$ checkjob 1901[1]
checking job 1901[1]

State: Idle  EState: Deferred
Creds:  user:gustafson  group:pi  class:abaqus  qos:DEFAULT
WallTime: 00:00:00 of 41:16:00:00
SubmitTime: Fri Aug 16 14:17:19
  (Time Queued  Total: 00:02:09  Eligible: 00:00:22)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [k20][gpus=1]
Dedicated Resources Per Task: PROCS: 1  MEM: 100G


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  NoResources  (cannot create reservation for
job '1901[1]' (intital reservation attempt)
)
Holds:    Defer  (hold reason:  NoResources)
PE:  11.71  StartPriority:  1
cannot select job 1901[1] for partition DEFAULT (job hold active)




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130816/3e7e335c/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20130816/3e7e335c/attachment.bin 


More information about the torqueusers mailing list