[Mauiusers] Mystery Features Preventing Jobs from Running

Caleb Phillips cphillips at smallwhitecube.com
Thu Jul 21 18:01:19 MDT 2011


Steve, thanks for the tip.

I've been able to temporarily resolve this by creating a new queue and 
submitting jobs there instead of the default batch queue. The default 
batch queue is still strangely nonfunctional, though...

> qsub -I -l nodes=fu48core.esl   ?

Same result. Job is held indefinitely and complains about features.

> We define features for every node. I think the reason you might be having trouble is because
>
> from:
> pbs/server_priv/nodes
>
> bh001 np=4 compute
>
> Then set a queue attribute of: resources.default_neednodes = compute

It turns out I had a resources_default.neednodes attribute on the 
(default) batch queue that was defining these mystery attributes "1:ppn=1".

I tried your suggestion, and changed this attribute to "compute" like so:

$ qmgr -c "set queue batch resources_default.neednodes = compute"

And, I added the compute feature to the nodes file. However, this didn't 
fix the problem. New jobs are still getting created with the "1:ppn" 
features requested by default, even though it's been removed from the 
configuration and the server has been restarted. I have no idea where 
these features are coming from!

I created a new, very basic queue following the example at [1]. Jobs 
submitted to the new queue run without problem. I'm still curious what's 
up with the batch queue and why it is marking all jobs with the features 
"1:ppn=1", but for now I'm able to run jobs, so I'm happy.

Incidentally, here's the qmgr output for the queue that is not working:

> caleb at torqueserver:~$ qmgr -c "print queue batch"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch max_running = 126
> set queue batch resources_max.ncpus = 8
> set queue batch resources_max.nodes = 1
> set queue batch resources_max.walltime = 99:00:00
> set queue batch resources_min.ncpus = 1
> set queue batch resources_default.ncpus = 1
> set queue batch resources_default.nodect = 1
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.walltime = 24:00:00
> set queue batch resources_available.nodect = 1
> set queue batch max_user_run = 100
> set queue batch enabled = True
> set queue batch started = True

Here's the qmgr output for the queue that is working:

> caleb at torqueserver:~$ qmgr -c "print queue foo"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue foo
> #
> create queue foo
> set queue foo queue_type = Execution
> set queue foo resources_default.nodes = 1
> set queue foo resources_default.walltime = 24:00:00
> set queue foo enabled = True
> set queue foo started = True

[1] 
http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml#example

> for the particular queue.
>
> - From there, Maui will query torque, and know that the node bh001 has a compute feature, so when you submit a job to a queue, it should be mapped to bh001 via the node features.
>
> I'm actually not sure if you can submit jobs and have them run on nodes w/o defining node features.
>
> On Jul 20, 2011, at 6:59 PM, Caleb Phillips wrote:
>
>> Hello all:
>>
>> I'm running torque 2.3.6 (packaged with Ubuntu 10.10) and maui 3.3.1.
>> I'm having an issue where submitted jobs sit in the queue indefinitely.
>> This was occurring with pbs_sched, so I installed maui hoping it would
>> fix the problem. With maui, I have more information about the problem,
>> but no resolution. I've spent several hours searching the torqueusers
>> and mauiusers mailing lists, and reading the manuals, to no avail. I
>> hope you can help...
>>
>> As far as I can tell, maui is complaining that there are not sufficient
>> "feasible procs" for jobs to run because of a lack of "features". My
>> nodes have no features enabled, and I'm not requesting any with my jobs.
>> Yet, the jobs show up with "[1][ppn=1]" in the feature list. I don't
>> know where these features are coming from or how to unset them, or if
>> that's really the source of the problem (it's simply my best guess). Any
>> ideas?
>>
>> Here's more information on my setup and how I reproduce the problem:
>>
>> I have one node (currently online). It has 48 processors:
>>
>>> caleb at torqueserver:~$ qnodes
>>> fu48core.esl
>>>      state = free
>>>      np = 48
>>>      ntype = cluster
>>>      status = opsys=linux,uname=Linux 48core 2.6.32-25-server #45-Ubuntu SMP Sat Oct 16 20:06:58 UTC 2010 x86_64,sessions=2834 5874 12296 13555 19465 17575,nsessions=6,nusers=3,idletime=2308,totmem=82007668kb,availmem=73380372kb,physmem=82007668kb,ncpus=48,loadave=2.19,netload=24944834533,state=free,jobs=,varattr=,rectime=1311202191
>>
>> It's free and presumably happy:
>>
>>> caleb at torqueserver:/usr/local/maui$ checknode fu48core
>>>
>>> checking node fu48core.esl
>>>
>>> State:      Idle  (in current state for 5:15:40)
>>> Configured Resources: PROCS: 48  MEM: 78G  SWAP: 78G  DISK: 1M
>>> Utilized   Resources: SWAP: 8426M
>>> Dedicated  Resources: [NONE]
>>> Opsys:         linux  Arch:      [NONE]
>>> Speed:      1.00  Load:       2.240
>>> Network:    [DEFAULT]
>>> Features:   [NONE]
>>> Attributes: [Batch]
>>> Classes:    [batch 48:48][amplhack 48:48][qualnet 48:48][lightweight 48:48]
>>>
>>> Total Time: 6:19:49  Up: 6:19:49 (100.00%)  Active: 00:00:00 (0.00%)
>>>
>>> Reservations:
>>> NOTE:  no reservations on node
>>
>> The batch queue is empty. If I submit a very basic job (I've tried more
>> complicated jobs too, with specific resource requests), it gets deferred
>> immediately:
>>
>>> caleb at torqueserver:/usr/local/maui$ echo "sleep 30" | qsub
>>> 25.torqueserver.esl
>>> caleb at torqueserver:/usr/local/maui$ checkjob 25
>>> checking job 25
>>>
>>> State: Idle  EState: Deferred
>>> Creds:  user:caleb  group:abelian  class:batch  qos:DEFAULT
>>> WallTime: 00:00:00 of 1:00:00:00
>>> SubmitTime: Wed Jul 20 16:52:37
>>>   (Time Queued  Total: 00:00:31  Eligible: 00:00:00)
>>>
>>> Total Tasks: 1
>>>
>>> Req[0]  TaskCount: 1  Partition: ALL
>>> Network: [NONE]  Memory>= 0  Disk>= 0  Swap>= 0
>>> Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
>>> NodeCount: 1
>>>
>>> IWD: [NONE]  Executable:  [NONE]
>>> Bypass: 0  StartCount: 0
>>> PartitionMask: [ALL]
>>> Flags:       RESTARTABLE
>>>
>>> job is deferred.  Reason:  NoResources  (cannot create reservation for job '25' (intital reservation attempt)
>>> )
>>> Holds:    Defer  (hold reason:  NoResources)
>>> PE:  1.00  StartPriority:  1
>>> cannot select job 25 for partition DEFAULT (job hold active)
>>
>> If I release the job, I can see that maui's complaining about a lack of
>> feasible procs due to unavailable features:
>>
>>> caleb at torqueserver:/usr/local/maui$ releasehold 25
>>>
>>> job holds adjusted
>>> caleb at torqueserver:/usr/local/maui$ checkjob -v 25
>>>
>>>
>>> checking job 25 (RM job '25.torqueserver.esl')
>>>
>>> State: Idle
>>> Creds:  user:caleb  group:abelian  class:batch  qos:DEFAULT
>>> WallTime: 00:00:00 of 1:00:00:00
>>> SubmitTime: Wed Jul 20 16:52:37
>>>   (Time Queued  Total: 00:04:39  Eligible: 00:02:35)
>>>
>>> Total Tasks: 1
>>>
>>> Req[0]  TaskCount: 1  Partition: ALL
>>> Network: [NONE]  Memory>= 0  Disk>= 0  Swap>= 0
>>> Opsys: [NONE]  Arch: [NONE]  Features: [1][ppn=1]
>>> Exec:  ''  ExecSize: 0  ImageSize: 0
>>> Dedicated Resources Per Task: PROCS: 1
>>> NodeAccess: SHARED
>>> NodeCount: 1
>>>
>>>
>>> IWD: [NONE]  Executable:  [NONE]
>>> Bypass: 0  StartCount: 0
>>> PartitionMask: [ALL]
>>> Flags:       RESTARTABLE
>>>
>>> Messages:  cannot create reservation for job '25' (intital reservation attempt)
>>>
>>> PE:  1.00  StartPriority:  2
>>> job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
>>> idle procs:  48  feasible procs:   0
>>>
>>> Rejection Reasons: [Features     :    1]
>>>
>>> Detailed Node Availability Information:
>>>
>>> fu48core.esl             rejected : Features
>>
>> There are no error messages in the torque server_log, maui's log file,
>> or the node's mom_log. In fact, my node never even sees the job since
>> maui never decides to run it.
>>
>> Any help you can provide would be extremely helpful. Thanks!
>>
>> --
>> Caleb Phillips, Ph.D. Candidate
>> Computer Science Department
>> University of Colorado, Boulder
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>   ----------------------
>   Steve Crusan
>   System Administrator
>   Center for Research Computing
>   University of Rochester
>   https://www.crc.rochester.edu/
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQEcBAEBAgAGBQJOKJqRAAoJENS19LGOpgqKRmYH+wUgAcq1B4If6qSE+EWT0MEc
> uWp/caUMzy7FO2GYuVaAWtCVPBkUCo6QWlu97L+vQlpSa88yhEYwqZdKE+4ygFs4
> gycahUdZeOAYukvqj+cRaUkOtK+DKaLio+Ehh9NyMOfR18w4y+iAbN451UYLESXd
> Ib+Pn2m7C7BN9rdejVyX0Cx/MjflXxXmnXfvGH1QjD4wtWqBBr3KVjZu+qw+VmQw
> XTu8YIqQxWp0+ITa+rBOhgnWVjgRy1qFM4rLqxJIVPytQKjp4I2zA34l6OX+6SRN
> BCbKeUoumqUE1RstuScp8O4HKGqL6GKHpjZAOmvX4JNmeewEWbZMW9eqbp0GQ88=
> =ZRP5
> -----END PGP SIGNATURE-----
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>



More information about the mauiusers mailing list