[torqueusers] torque/maui problem - defered jobs "Invalid request
(15004)"
Austin Godber
godber at mars.asu.edu
Thu Dec 22 13:13:46 MST 2005
Hello,
I have a linux cluster that is half athlons and half opterons. I have
defined some of my queues to restrict jobs only to athlon or opeterons
using a definition like this:
create queue low-o
set queue low-o queue_type = Execution
set queue low-o Priority = 250
set queue low-o max_running = 80
set queue low-o resources_default.neednodes = linux-opt32
set queue low-o acl_group_enable = True
set queue low-o acl_groups = -ecu
set queue low-o acl_groups += +
set queue low-o enabled = True
set queue low-o started = True
Also I have defined in the maui.cfg:
CLASSWEIGHT 1
CLASSCFG[asap-a] MAXPROC=80,100
CLASSCFG[high-a] MAXPROC=50,100
CLASSCFG[normal-a] MAXPROC=30,100
CLASSCFG[low-a] MAXPROC=20,100
CLASSCFG[asap-o] MAXPROC=50,80
CLASSCFG[high-o] MAXPROC=40,80
CLASSCFG[normal-o] MAXPROC=30,80
CLASSCFG[low-o] MAXPROC=20,80
CLASSCFG[workq] MAXPROC=50,200
CLASSCFG[x86_64] MAXPROC=21,22
CLASSCFG[horus] MAXPROC=74,75
CLASSCFG[ecu] MAXPROC=27,28
CLASSCFG[sqamar] MAXPROC=23,24
CLASSCFG[isisweb] MAXPROC=74,75
At the moment we are having some problems with jobs sent to our opteron
nodes. It appears that when users submit a couple hundred jobs only the
first 30 or so complete and the rest end up DEFERED. The exact number
that finshes seems to be about the soft maxproc limit for each queue.
Once the first group of jobs complete, I can "releasehold -a" the
remaining jobs ... and the next batch of 30 or so will run, leaving the
rest in the defered state.
Checkjob output looks like this for a defered job:
---------------------------------------------------------------------
State: Idle EState: Deferred
Creds: user:dombovar group:thmops class:normal-o qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Thu Dec 22 10:55:34
(Time Queued Total: 2:15:23 Eligible: 2:14:46)
StartDate: -00:00:36 Thu Dec 22 13:10:21
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [linux-opt32]
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 158
PartitionMask: [ALL]
Flags: RESTARTABLE
job is deferred. Reason: RMFailure (cannot start job - RM failure,
rc: 15041, msg: 'Execution server rejected request MSG=send failed,
STARTING')
Holds: Defer (hold reason: RMFailure)
PE: 1.00 StartPriority: 634
cannot select job 119834 for partition DEFAULT (job hold active)
---------------------------------------------------------------------
And I get errors in torque/server_logs/20051222 like this:
Invalid request (15004) in send_job, child failed in previous commit
request for job
Any clues? The only difference between the opteron queues (where this
happens) and the athlon queues (where it doesn't) is this line:
set queue low-o resources_default.neednodes = linux-opt32
versus
set queue low-a resources_default.neednodes = linux-ath32
Thanks,
Austin
More information about the torqueusers
mailing list