[torqueusers] torque/maui problem - defered jobs "Invalid request (15004)"

Austin Godber godber at mars.asu.edu
Thu Dec 22 13:13:46 MST 2005


Hello,
	I have a linux cluster that is half athlons and half opterons.  I have
defined some of my queues to restrict jobs only to athlon or opeterons
using a definition like this:

create queue low-o
set queue low-o queue_type = Execution
set queue low-o Priority = 250
set queue low-o max_running = 80
set queue low-o resources_default.neednodes = linux-opt32
set queue low-o acl_group_enable = True
set queue low-o acl_groups = -ecu
set queue low-o acl_groups += +
set queue low-o enabled = True
set queue low-o started = True


Also I have defined in the maui.cfg:

CLASSWEIGHT     1
CLASSCFG[asap-a]        MAXPROC=80,100
CLASSCFG[high-a]        MAXPROC=50,100
CLASSCFG[normal-a]      MAXPROC=30,100
CLASSCFG[low-a]         MAXPROC=20,100
CLASSCFG[asap-o]        MAXPROC=50,80
CLASSCFG[high-o]        MAXPROC=40,80
CLASSCFG[normal-o]      MAXPROC=30,80
CLASSCFG[low-o]         MAXPROC=20,80
CLASSCFG[workq]         MAXPROC=50,200
CLASSCFG[x86_64]        MAXPROC=21,22
CLASSCFG[horus]         MAXPROC=74,75
CLASSCFG[ecu]           MAXPROC=27,28
CLASSCFG[sqamar]        MAXPROC=23,24
CLASSCFG[isisweb]       MAXPROC=74,75

At the moment we are having some problems with jobs sent to our opteron
nodes.  It appears that when users submit a couple hundred jobs only the
first 30 or so complete and the rest end up DEFERED.  The exact number
that finshes seems to be about the soft maxproc limit for each queue.

Once the first group of jobs complete, I can "releasehold -a" the
remaining jobs ... and the next batch of 30 or so will run, leaving the
rest in the defered state.

Checkjob output looks like this for a defered job:

---------------------------------------------------------------------
State: Idle  EState: Deferred
Creds:  user:dombovar  group:thmops  class:normal-o  qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Thu Dec 22 10:55:34
  (Time Queued  Total: 2:15:23  Eligible: 2:14:46)

StartDate: -00:00:36  Thu Dec 22 13:10:21
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [linux-opt32]
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 158
PartitionMask: [ALL]
Flags:       RESTARTABLE

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure,
rc: 15041, msg: 'Execution server rejected request MSG=send failed,
STARTING')
Holds:    Defer  (hold reason:  RMFailure)
PE:  1.00  StartPriority:  634
cannot select job 119834 for partition DEFAULT (job hold active)
---------------------------------------------------------------------

And I get errors in torque/server_logs/20051222 like this:

Invalid request (15004) in send_job, child failed in previous commit
request for job


Any clues?  The only difference between the opteron queues (where this
happens) and the athlon queues (where it doesn't) is this line:
	set queue low-o resources_default.neednodes = linux-opt32
versus
	set queue low-a resources_default.neednodes = linux-ath32


Thanks,
Austin


More information about the torqueusers mailing list