[Mauiusers] Mystery Features Preventing Jobs from Running
Steve Crusan
scrusan at ur.rochester.edu
Thu Jul 21 15:30:49 MDT 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
What happens if you just do a simple qsub like this:
qsub -I -l nodes=fu48core.esl ?
We define features for every node. I think the reason you might be having trouble is because
from:
pbs/server_priv/nodes
bh001 np=4 compute
Then set a queue attribute of: resources.default_neednodes = compute
for the particular queue.
- From there, Maui will query torque, and know that the node bh001 has a compute feature, so when you submit a job to a queue, it should be mapped to bh001 via the node features.
I'm actually not sure if you can submit jobs and have them run on nodes w/o defining node features.
On Jul 20, 2011, at 6:59 PM, Caleb Phillips wrote:
> Hello all:
>
> I'm running torque 2.3.6 (packaged with Ubuntu 10.10) and maui 3.3.1.
> I'm having an issue where submitted jobs sit in the queue indefinitely.
> This was occurring with pbs_sched, so I installed maui hoping it would
> fix the problem. With maui, I have more information about the problem,
> but no resolution. I've spent several hours searching the torqueusers
> and mauiusers mailing lists, and reading the manuals, to no avail. I
> hope you can help...
>
> As far as I can tell, maui is complaining that there are not sufficient
> "feasible procs" for jobs to run because of a lack of "features". My
> nodes have no features enabled, and I'm not requesting any with my jobs.
> Yet, the jobs show up with "[1][ppn=1]" in the feature list. I don't
> know where these features are coming from or how to unset them, or if
> that's really the source of the problem (it's simply my best guess). Any
> ideas?
>
> Here's more information on my setup and how I reproduce the problem:
>
> I have one node (currently online). It has 48 processors:
>
>> caleb at torqueserver:~$ qnodes
>> fu48core.esl
>> state = free
>> np = 48
>> ntype = cluster
>> status = opsys=linux,uname=Linux 48core 2.6.32-25-server #45-Ubuntu SMP Sat Oct 16 20:06:58 UTC 2010 x86_64,sessions=2834 5874 12296 13555 19465 17575,nsessions=6,nusers=3,idletime=2308,totmem=82007668kb,availmem=73380372kb,physmem=82007668kb,ncpus=48,loadave=2.19,netload=24944834533,state=free,jobs=,varattr=,rectime=1311202191
>
> It's free and presumably happy:
>
>> caleb at torqueserver:/usr/local/maui$ checknode fu48core
>>
>> checking node fu48core.esl
>>
>> State: Idle (in current state for 5:15:40)
>> Configured Resources: PROCS: 48 MEM: 78G SWAP: 78G DISK: 1M
>> Utilized Resources: SWAP: 8426M
>> Dedicated Resources: [NONE]
>> Opsys: linux Arch: [NONE]
>> Speed: 1.00 Load: 2.240
>> Network: [DEFAULT]
>> Features: [NONE]
>> Attributes: [Batch]
>> Classes: [batch 48:48][amplhack 48:48][qualnet 48:48][lightweight 48:48]
>>
>> Total Time: 6:19:49 Up: 6:19:49 (100.00%) Active: 00:00:00 (0.00%)
>>
>> Reservations:
>> NOTE: no reservations on node
>
> The batch queue is empty. If I submit a very basic job (I've tried more
> complicated jobs too, with specific resource requests), it gets deferred
> immediately:
>
>> caleb at torqueserver:/usr/local/maui$ echo "sleep 30" | qsub
>> 25.torqueserver.esl
>> caleb at torqueserver:/usr/local/maui$ checkjob 25
>> checking job 25
>>
>> State: Idle EState: Deferred
>> Creds: user:caleb group:abelian class:batch qos:DEFAULT
>> WallTime: 00:00:00 of 1:00:00:00
>> SubmitTime: Wed Jul 20 16:52:37
>> (Time Queued Total: 00:00:31 Eligible: 00:00:00)
>>
>> Total Tasks: 1
>>
>> Req[0] TaskCount: 1 Partition: ALL
>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1]
>> NodeCount: 1
>>
>> IWD: [NONE] Executable: [NONE]
>> Bypass: 0 StartCount: 0
>> PartitionMask: [ALL]
>> Flags: RESTARTABLE
>>
>> job is deferred. Reason: NoResources (cannot create reservation for job '25' (intital reservation attempt)
>> )
>> Holds: Defer (hold reason: NoResources)
>> PE: 1.00 StartPriority: 1
>> cannot select job 25 for partition DEFAULT (job hold active)
>
> If I release the job, I can see that maui's complaining about a lack of
> feasible procs due to unavailable features:
>
>> caleb at torqueserver:/usr/local/maui$ releasehold 25
>>
>> job holds adjusted
>> caleb at torqueserver:/usr/local/maui$ checkjob -v 25
>>
>>
>> checking job 25 (RM job '25.torqueserver.esl')
>>
>> State: Idle
>> Creds: user:caleb group:abelian class:batch qos:DEFAULT
>> WallTime: 00:00:00 of 1:00:00:00
>> SubmitTime: Wed Jul 20 16:52:37
>> (Time Queued Total: 00:04:39 Eligible: 00:02:35)
>>
>> Total Tasks: 1
>>
>> Req[0] TaskCount: 1 Partition: ALL
>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1]
>> Exec: '' ExecSize: 0 ImageSize: 0
>> Dedicated Resources Per Task: PROCS: 1
>> NodeAccess: SHARED
>> NodeCount: 1
>>
>>
>> IWD: [NONE] Executable: [NONE]
>> Bypass: 0 StartCount: 0
>> PartitionMask: [ALL]
>> Flags: RESTARTABLE
>>
>> Messages: cannot create reservation for job '25' (intital reservation attempt)
>>
>> PE: 1.00 StartPriority: 2
>> job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
>> idle procs: 48 feasible procs: 0
>>
>> Rejection Reasons: [Features : 1]
>>
>> Detailed Node Availability Information:
>>
>> fu48core.esl rejected : Features
>
> There are no error messages in the torque server_log, maui's log file,
> or the node's mom_log. In fact, my node never even sees the job since
> maui never decides to run it.
>
> Any help you can provide would be extremely helpful. Thanks!
>
> --
> Caleb Phillips, Ph.D. Candidate
> Computer Science Department
> University of Colorado, Boulder
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJOKJqRAAoJENS19LGOpgqKRmYH+wUgAcq1B4If6qSE+EWT0MEc
uWp/caUMzy7FO2GYuVaAWtCVPBkUCo6QWlu97L+vQlpSa88yhEYwqZdKE+4ygFs4
gycahUdZeOAYukvqj+cRaUkOtK+DKaLio+Ehh9NyMOfR18w4y+iAbN451UYLESXd
Ib+Pn2m7C7BN9rdejVyX0Cx/MjflXxXmnXfvGH1QjD4wtWqBBr3KVjZu+qw+VmQw
XTu8YIqQxWp0+ITa+rBOhgnWVjgRy1qFM4rLqxJIVPytQKjp4I2zA34l6OX+6SRN
BCbKeUoumqUE1RstuScp8O4HKGqL6GKHpjZAOmvX4JNmeewEWbZMW9eqbp0GQ88=
=ZRP5
-----END PGP SIGNATURE-----
More information about the mauiusers
mailing list