[Mauiusers] Mystery Features Preventing Jobs from Running

Caleb Phillips cphillips at smallwhitecube.com
Wed Jul 20 16:59:32 MDT 2011


Hello all:

I'm running torque 2.3.6 (packaged with Ubuntu 10.10) and maui 3.3.1. 
I'm having an issue where submitted jobs sit in the queue indefinitely. 
This was occurring with pbs_sched, so I installed maui hoping it would 
fix the problem. With maui, I have more information about the problem, 
but no resolution. I've spent several hours searching the torqueusers 
and mauiusers mailing lists, and reading the manuals, to no avail. I 
hope you can help...

As far as I can tell, maui is complaining that there are not sufficient 
"feasible procs" for jobs to run because of a lack of "features". My 
nodes have no features enabled, and I'm not requesting any with my jobs. 
Yet, the jobs show up with "[1][ppn=1]" in the feature list. I don't 
know where these features are coming from or how to unset them, or if 
that's really the source of the problem (it's simply my best guess). Any 
ideas?

Here's more information on my setup and how I reproduce the problem:

I have one node (currently online). It has 48 processors:

> caleb at torqueserver:~$ qnodes
> fu48core.esl
>      state = free
>      np = 48
>      ntype = cluster
>      status = opsys=linux,uname=Linux 48core 2.6.32-25-server #45-Ubuntu SMP Sat Oct 16 20:06:58 UTC 2010 x86_64,sessions=2834 5874 12296 13555 19465 17575,nsessions=6,nusers=3,idletime=2308,totmem=82007668kb,availmem=73380372kb,physmem=82007668kb,ncpus=48,loadave=2.19,netload=24944834533,state=free,jobs=,varattr=,rectime=1311202191

It's free and presumably happy:

> caleb at torqueserver:/usr/local/maui$ checknode fu48core
>
> checking node fu48core.esl
>
> State:      Idle  (in current state for 5:15:40)
> Configured Resources: PROCS: 48  MEM: 78G  SWAP: 78G  DISK: 1M
> Utilized   Resources: SWAP: 8426M
> Dedicated  Resources: [NONE]
> Opsys:         linux  Arch:      [NONE]
> Speed:      1.00  Load:       2.240
> Network:    [DEFAULT]
> Features:   [NONE]
> Attributes: [Batch]
> Classes:    [batch 48:48][amplhack 48:48][qualnet 48:48][lightweight 48:48]
>
> Total Time: 6:19:49  Up: 6:19:49 (100.00%)  Active: 00:00:00 (0.00%)
>
> Reservations:
> NOTE:  no reservations on node

The batch queue is empty. If I submit a very basic job (I've tried more 
complicated jobs too, with specific resource requests), it gets deferred 
immediately:

> caleb at torqueserver:/usr/local/maui$ echo "sleep 30" | qsub
> 25.torqueserver.esl
> caleb at torqueserver:/usr/local/maui$ checkjob 25
> checking job 25
>
> State: Idle  EState: Deferred
> Creds:  user:caleb  group:abelian  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 1:00:00:00
> SubmitTime: Wed Jul 20 16:52:37
>   (Time Queued  Total: 00:00:31  Eligible: 00:00:00)
>
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
> NodeCount: 1
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
> job is deferred.  Reason:  NoResources  (cannot create reservation for job '25' (intital reservation attempt)
> )
> Holds:    Defer  (hold reason:  NoResources)
> PE:  1.00  StartPriority:  1
> cannot select job 25 for partition DEFAULT (job hold active)

If I release the job, I can see that maui's complaining about a lack of 
feasible procs due to unavailable features:

> caleb at torqueserver:/usr/local/maui$ releasehold 25
>
> job holds adjusted
> caleb at torqueserver:/usr/local/maui$ checkjob -v 25
>
>
> checking job 25 (RM job '25.torqueserver.esl')
>
> State: Idle
> Creds:  user:caleb  group:abelian  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 1:00:00:00
> SubmitTime: Wed Jul 20 16:52:37
>   (Time Queued  Total: 00:04:39  Eligible: 00:02:35)
>
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
> Exec:  ''  ExecSize: 0  ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SHARED
> NodeCount: 1
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 0
> PartitionMask: [ALL]
> Flags:       RESTARTABLE
>
> Messages:  cannot create reservation for job '25' (intital reservation attempt)
>
> PE:  1.00  StartPriority:  2
> job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
> idle procs:  48  feasible procs:   0
>
> Rejection Reasons: [Features     :    1]
>
> Detailed Node Availability Information:
>
> fu48core.esl             rejected : Features

There are no error messages in the torque server_log, maui's log file, 
or the node's mom_log. In fact, my node never even sees the job since 
maui never decides to run it.

Any help you can provide would be extremely helpful. Thanks!

--
Caleb Phillips, Ph.D. Candidate
Computer Science Department
University of Colorado, Boulder


More information about the mauiusers mailing list