[Mauiusers] Mystery Features Preventing Jobs from Running

Steve Crusan scrusan at ur.rochester.edu
Thu Jul 21 15:30:49 MDT 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

What happens if you just do a simple qsub like this:

qsub -I -l nodes=fu48core.esl   ?





We define features for every node. I think the reason you might be having trouble is because  

from:
pbs/server_priv/nodes

bh001 np=4 compute




Then set a queue attribute of: resources.default_neednodes = compute 

for the particular queue. 

- From there, Maui will query torque, and know that the node bh001 has a compute feature, so when you submit a job to a queue, it should be mapped to bh001 via the node features.

I'm actually not sure if you can submit jobs and have them run on nodes w/o defining node features.




On Jul 20, 2011, at 6:59 PM, Caleb Phillips wrote:

> Hello all:
> 
> I'm running torque 2.3.6 (packaged with Ubuntu 10.10) and maui 3.3.1. 
> I'm having an issue where submitted jobs sit in the queue indefinitely. 
> This was occurring with pbs_sched, so I installed maui hoping it would 
> fix the problem. With maui, I have more information about the problem, 
> but no resolution. I've spent several hours searching the torqueusers 
> and mauiusers mailing lists, and reading the manuals, to no avail. I 
> hope you can help...
> 
> As far as I can tell, maui is complaining that there are not sufficient 
> "feasible procs" for jobs to run because of a lack of "features". My 
> nodes have no features enabled, and I'm not requesting any with my jobs. 
> Yet, the jobs show up with "[1][ppn=1]" in the feature list. I don't 
> know where these features are coming from or how to unset them, or if 
> that's really the source of the problem (it's simply my best guess). Any 
> ideas?
> 
> Here's more information on my setup and how I reproduce the problem:
> 
> I have one node (currently online). It has 48 processors:
> 
>> caleb at torqueserver:~$ qnodes
>> fu48core.esl
>>     state = free
>>     np = 48
>>     ntype = cluster
>>     status = opsys=linux,uname=Linux 48core 2.6.32-25-server #45-Ubuntu SMP Sat Oct 16 20:06:58 UTC 2010 x86_64,sessions=2834 5874 12296 13555 19465 17575,nsessions=6,nusers=3,idletime=2308,totmem=82007668kb,availmem=73380372kb,physmem=82007668kb,ncpus=48,loadave=2.19,netload=24944834533,state=free,jobs=,varattr=,rectime=1311202191
> 
> It's free and presumably happy:
> 
>> caleb at torqueserver:/usr/local/maui$ checknode fu48core
>> 
>> checking node fu48core.esl
>> 
>> State:      Idle  (in current state for 5:15:40)
>> Configured Resources: PROCS: 48  MEM: 78G  SWAP: 78G  DISK: 1M
>> Utilized   Resources: SWAP: 8426M
>> Dedicated  Resources: [NONE]
>> Opsys:         linux  Arch:      [NONE]
>> Speed:      1.00  Load:       2.240
>> Network:    [DEFAULT]
>> Features:   [NONE]
>> Attributes: [Batch]
>> Classes:    [batch 48:48][amplhack 48:48][qualnet 48:48][lightweight 48:48]
>> 
>> Total Time: 6:19:49  Up: 6:19:49 (100.00%)  Active: 00:00:00 (0.00%)
>> 
>> Reservations:
>> NOTE:  no reservations on node
> 
> The batch queue is empty. If I submit a very basic job (I've tried more 
> complicated jobs too, with specific resource requests), it gets deferred 
> immediately:
> 
>> caleb at torqueserver:/usr/local/maui$ echo "sleep 30" | qsub
>> 25.torqueserver.esl
>> caleb at torqueserver:/usr/local/maui$ checkjob 25
>> checking job 25
>> 
>> State: Idle  EState: Deferred
>> Creds:  user:caleb  group:abelian  class:batch  qos:DEFAULT
>> WallTime: 00:00:00 of 1:00:00:00
>> SubmitTime: Wed Jul 20 16:52:37
>>  (Time Queued  Total: 00:00:31  Eligible: 00:00:00)
>> 
>> Total Tasks: 1
>> 
>> Req[0]  TaskCount: 1  Partition: ALL
>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>> Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
>> NodeCount: 1
>> 
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 0
>> PartitionMask: [ALL]
>> Flags:       RESTARTABLE
>> 
>> job is deferred.  Reason:  NoResources  (cannot create reservation for job '25' (intital reservation attempt)
>> )
>> Holds:    Defer  (hold reason:  NoResources)
>> PE:  1.00  StartPriority:  1
>> cannot select job 25 for partition DEFAULT (job hold active)
> 
> If I release the job, I can see that maui's complaining about a lack of 
> feasible procs due to unavailable features:
> 
>> caleb at torqueserver:/usr/local/maui$ releasehold 25
>> 
>> job holds adjusted
>> caleb at torqueserver:/usr/local/maui$ checkjob -v 25
>> 
>> 
>> checking job 25 (RM job '25.torqueserver.esl')
>> 
>> State: Idle
>> Creds:  user:caleb  group:abelian  class:batch  qos:DEFAULT
>> WallTime: 00:00:00 of 1:00:00:00
>> SubmitTime: Wed Jul 20 16:52:37
>>  (Time Queued  Total: 00:04:39  Eligible: 00:02:35)
>> 
>> Total Tasks: 1
>> 
>> Req[0]  TaskCount: 1  Partition: ALL
>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>> Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
>> Exec:  ''  ExecSize: 0  ImageSize: 0
>> Dedicated Resources Per Task: PROCS: 1
>> NodeAccess: SHARED
>> NodeCount: 1
>> 
>> 
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 0
>> PartitionMask: [ALL]
>> Flags:       RESTARTABLE
>> 
>> Messages:  cannot create reservation for job '25' (intital reservation attempt)
>> 
>> PE:  1.00  StartPriority:  2
>> job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
>> idle procs:  48  feasible procs:   0
>> 
>> Rejection Reasons: [Features     :    1]
>> 
>> Detailed Node Availability Information:
>> 
>> fu48core.esl             rejected : Features
> 
> There are no error messages in the torque server_log, maui's log file, 
> or the node's mom_log. In fact, my node never even sees the job since 
> maui never decides to run it.
> 
> Any help you can provide would be extremely helpful. Thanks!
> 
> --
> Caleb Phillips, Ph.D. Candidate
> Computer Science Department
> University of Colorado, Boulder
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers

 ----------------------
 Steve Crusan
 System Administrator
 Center for Research Computing
 University of Rochester
 https://www.crc.rochester.edu/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJOKJqRAAoJENS19LGOpgqKRmYH+wUgAcq1B4If6qSE+EWT0MEc
uWp/caUMzy7FO2GYuVaAWtCVPBkUCo6QWlu97L+vQlpSa88yhEYwqZdKE+4ygFs4
gycahUdZeOAYukvqj+cRaUkOtK+DKaLio+Ehh9NyMOfR18w4y+iAbN451UYLESXd
Ib+Pn2m7C7BN9rdejVyX0Cx/MjflXxXmnXfvGH1QjD4wtWqBBr3KVjZu+qw+VmQw
XTu8YIqQxWp0+ITa+rBOhgnWVjgRy1qFM4rLqxJIVPytQKjp4I2zA34l6OX+6SRN
BCbKeUoumqUE1RstuScp8O4HKGqL6GKHpjZAOmvX4JNmeewEWbZMW9eqbp0GQ88=
=ZRP5
-----END PGP SIGNATURE-----


More information about the mauiusers mailing list