Maui Admin - Wiki Interface Specification
Wiki Interface Specification, version 1.1
COMMANDS:
All commands are requested via a socket interface,
one command per socket connection. All fields and values are specified
in ASCII text. Maui is configured to communicate via the wiki interface
by specifying the following parameters in the maui.cfg file:
RMTYPE[X]
WIKI
RMSERVER[X] <HOSTNAME>
RMPORT[X] <PORTNUMBER>
Field values must backslash escape the following
characters if specified:
'#' ';' ':'
(ie '\#')
Supported Commands are:
GETNODES,
GETJOBS, STARTJOB, CANCELJOB,
SUSPENDJOB, RESUMEJOB, JOBADDTASK, JOBRELEASETASK
GetNodes
send
CMD=GETNODES ARG={<UPDATETIME>:<NODEID>[:<NODEID>]...
| <UPDATETIME>:ALL}
Only nodes updated more recently
than <UPDATETIME> will be returned where <UPDATETIME> is specified
as the epoch time of interest. Setting <UPDATETIME> to '0' will
return information for all nodes. Specify a colon delimited list
of NODEID's if specific nodes are desired or use the keyword 'ALL' to receive
information for all nodes.
receive
SC=<STATUSCODE> ARG=<NODECOUNT>#<NODEID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...[#<NODEID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...
or
SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE Values:
0
SUCCESS
-1 INTERNAL
ERROR
FIELD
is either the text name listed below or 'A<FIELDNUM>' (ie, 'UPDATETIME'
or 'A2')
RESPONSE is a statuscode sensitive message
describing error or state details
EXAMPLE:
send 'CMD=GETNODES ARG=0:node001:node002:node003'
receive 'SC=0 ARG=4#node001:UPDATETIME=963004212;STATE=Busy;OS=AIX43;ARCH=RS6000;...'
Field Values
| INDEX |
NAME |
FORMAT |
DEFAULT |
DESCRIPTION |
| 1 |
UPDATETIME* |
<EPOCHTIME> |
0 |
time node information was last updated |
| 2 |
STATE* |
one of the following: Idle, Running, Busy, Unknown,Draining, or Down |
Down |
state of the node |
| 3 |
OS |
<STRING> |
[NONE] |
operating system running on node |
| 4 |
ARCH |
<STRING> |
[NONE] |
compute architecture of node |
| 5 |
CMEMORY |
<INTEGER> |
0 |
configured RAM on node (in MB) |
| 6 |
AMEMORY |
<INTEGER> |
0 |
available/free RAM on node (in MB) |
| 7 |
CSWAP |
<INTEGER> |
0 |
configured swap on node (in MB) |
| 8 |
ASWAP |
<INTEGER> |
0 |
available swap on node (in MB) |
| 9 |
CDISK |
<INTEGER> |
0 |
configured local disk on node (in MB) |
| 10 |
ADISK |
<INTEGER> |
0 |
available local disk on node (in MB) |
| 11 |
CPROC |
<INTEGER> |
1 |
configured processors on node |
| 12 |
APROC |
<INTEGER> |
1 |
available processors on node |
| 13 |
CNET |
one or more colon delimited <STRING>'s (ie, ETHER:FDDI:ATM) |
[NONE] |
configured network interfaces on node |
| 14 |
ANET |
one or more colon delimited <STRING>'s (ie, ETHER:ATM) |
[NONE] |
Available network interfaces on node. Available interfaces are
those which are 'up' and not already dedicated to a job. |
| 15 |
CPULOAD |
<DOUBLE> |
0.0 |
one minute BSD load average |
| 16 |
CCLASS |
one or more bracket enclosed <NAME>:<COUNT> pairs (ie, [batch:5][sge:3]) |
[NONE] |
Run classes supported by node. Typically, one class is 'consumed'
per task. Thus, an 8 processor node may have 8 instances of each
class it supports present, ie [batch:8][interactive:8] |
| 17 |
ACLASS |
one or more bracket enclosed <NAME>:<COUNT> pairs (ie, [batch:5][sge:3]) |
[NONE] |
run classes currently available on node. If not specified, scheduler
will attempt to determine actual ACLASS value. |
| 18 |
FEATURE |
one or more colon delimited <STRING>'s (ie, WIDE:HSM) |
[NONE] |
generic attributes, often describing hardware or software features,
associated with the node. |
| 19 |
PARTITION |
<STRING> |
DEFAULT |
partition to which node belongs |
| 20 |
EVENT |
<STRING> |
[NONE] |
Event or exception which occurred on the node |
| 21 |
CURRENTTASK |
<INTEGER> |
0 |
Number of tasks currently active on the node |
| 22 |
MAXTASK |
<INTEGER> |
<CPROC> |
Maximum number of tasks allowed on the node at any given time |
| 23 |
SPEED |
<DOUBLE> |
1.0 |
Relative processor speed of the node |
| 24 |
FRAME |
<INTEGER> |
0 |
Frame location of the node |
| 25 |
SLOT |
<INTEGER> |
0 |
Slot location of the node |
| 26 |
CRES |
one or more colon delimited <NAME>,<VALUE> pairs (ie, MATLAB,6:COMPILER,100) |
[NONE] |
Arbitrary consumable resources supported and tracked on the node, ie
software licenses or tape drives. |
| 27 |
ARES |
one or more colon delimited <NAME>,<VALUE> pairs (ie, MATLAB,6:COMPILER,100) |
[NONE] |
Arbitrary consumable resources currently available on the node |
* indicates required field
NOTE 1: node states have the following definitions:
Idle:
Node is ready to run jobs but currently is not running any.
Running: Node
is running some jobs and will accept additional jobs
Busy:
Node is running some jobs and will not accept additional jobs
Unknown: Node is capable
of running jobs but the scheduler will need to determine if the node state
is actually Idle, Running, or Busy.
Draining: Node is
responding but will not accept new jobs
Down:
Resource Manager problems have been detected. Node is incapable of
running jobs.
GetJobs
send
CMD=GETJOBS ARG={<UPDATETIME>:<JOBID>[:<JOBID>]...
| <UPDATETIME>:ALL }
Only jobs updated more recently
than <UPDATETIME> will be returned where <UPDATETIME> is specified
as the epoch time of interest. Setting <UPDATETIME> to '0' will
return information for all jobs. Specify a colon delimited list of
JOBID's if information for specific jobs is desired or use the keyword
'ALL' to receive information about all jobs
receive
SC=<STATUSCODE> ARG=<JOBCOUNT>#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...[#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...
or
SC=<STATUSCODE> RESPONSE=<RESPONSE>
FIELD
is either the text name listed below or 'A<FIELDNUM>'
(ie, 'UPDATETIME' or 'A2')
STATUSCODE values:
0 SUCCESS
-1 INTERNAL ERROR
RESPONSE is a
statuscode sensitive message describing error or state details
EXAMPLE:
send 'CMD=GETJOBS ARG=0:ALL'
receive 'ARG=2#nebo3001.0:UPDATETIME=9780000320;STATE=Idle;WCLIMIT=3600;...'
Table of Job Field Values
| INDEX |
NAME |
FORMAT |
DEFAULT |
DESCRIPTION |
| 1 |
UPDATETIME* |
<EPOCHTIME> |
0 |
Time job was last updated |
| 2 |
STATE* |
one of Idle, Running, Hold, Suspended, Completed, or Removed |
Idle |
State of job |
| 3 |
WCLIMIT* |
<INTEGER> |
864000 |
Seconds of wall time required by job |
| 4 |
TASKS* |
<INTEGER> |
1 |
Number of tasks required by job (See Task
Definition for more info) |
| 5 |
NODES |
<INTEGER> |
1 |
Number of nodes required by job (See Node
Definition for more info) |
| 6 |
GEOMETRY |
<STRING> |
[NONE] |
String describing task geometry required by job |
| 7 |
QUEUETIME* |
<EPOCHTIME> |
0 |
time job was submitted to resource manager |
| 8 |
STARTDATE |
<EPOCHTIME> |
0 |
earliest time job should be allowed to start |
| 9 |
STARTTIME* |
<EPOCHTIME> |
0 |
time job was started by the resource manager |
| 10 |
COMPLETETIME* |
<EPOCHTIME> |
0 |
time job completed execution |
| 11 |
UNAME* |
<STRING> |
[NONE] |
UserID under which job will run |
| 12 |
GNAME* |
<STRING> |
[NONE] |
GroupID under which job will run |
| 13 |
ACCOUNT |
<STRING> |
[NONE] |
AccountID associated with job |
| 14 |
RFEATURES |
colon delimited list <STRING>'s |
[NONE] |
List of features required on nodes |
| 15 |
RNETWORK |
<STRING> |
[NONE] |
network adapter required by job |
| 16 |
DNETWORK |
<STRING> |
[NONE] |
network adapter which must be dedicated to job |
| 17 |
RCLASS |
list of bracket enclosed <STRING>:<INTEGER> pairs |
[NONE] |
list of <CLASSNAME>:<COUNT> pairs indicating type and number
of class instances required per task. (ie, '[batch:1]' or '[batch:2][tape:1]') |
| 18 |
ROPSYS |
<STRING> |
[NONE] |
operating system required by job |
| 19 |
RARCH |
<STRING> |
[NONE] |
architecture required by job |
| 20 |
RMEM |
<INTEGER> |
0 |
real memory (RAM, in MB) required to be configured on nodes allocated
to the job |
| 21 |
RMEMCMP |
one of '>=', '>', '==', '<', or '<=' |
>= |
real memory comparison (ie, node must have >= 512MB RAM) |
| 22 |
DMEM |
<INTEGER> |
0 |
quantity of real memory (RAM, in MB) which must be dedicated to each
task of the job |
| 23 |
RDISK |
<INTEGER> |
0 |
local disk space (in MB) required to be configured on nodes allocated
to the job |
| 24 |
RDISKCMP |
one of '>=', '>', '==', '<', or '<=' |
>= |
local disk comparison (ie, node must have > 2048 MB local disk) |
| 25 |
DDISK |
<INTEGER> |
0 |
quantity of local disk space (in MB) which must be dedicated to each
task of the job |
| 26 |
RSWAP |
<INTEGER> |
0 |
virtual memory (swap, in MB) required to be configured on nodes allocated
to the job |
| 27 |
RSWAPCMP |
one of '>=', '>', '==', '<', or '<=' |
>= |
virtual memory comparison (ie, node must have ==4096 MB virtual memory) |
| 28 |
DSWAP |
<INTEGER> |
0 |
quantity of virtual memory (swap, in MB) which must be dedicated to
each task of the job |
| 29 |
PARTITIONMASK |
one or more colon delimited <STRING>s |
[ANY] |
list of partitions in which job can run |
| 30 |
EXEC |
<STRING> |
[NONE] |
job executable command |
| 31 |
ARGS |
<STRING> |
[NONE] |
job command-line arguments |
| 32 |
IWD |
<STRING> |
[NONE] |
job's initial working directory |
| 33 |
COMMENT |
<STRING> |
0 |
general job attributes not described by other field |
| 34 |
REJCOUNT |
<INTEGER> |
0 |
number of times job was rejected |
| 35 |
REJMESSAGE |
<STRING> |
[NONE] |
text description of reason job was rejected |
| 36 |
REJCODE |
<INTEGER> |
0 |
reason job was rejected |
| 37 |
EVENT |
<EVENT> |
[NONE] |
event or exception experienced by job |
| 38 |
TASKLIST |
one or more colon delimited <STRING>s |
[NONE] |
node ID associated with each active task of job (i.e., cl01, cl02,
cl01, cl02, cl03) The tasklist is initially selected by the scheduler
at the time the StartJob command is issued. The resource manager
is then responsible for starting the job on these nodes and maintaining
this task distribution information throughout the life of the job.
In Maui 3.x, a job's tasklist is static throughout the life of the job, |
| 39 |
TASKPERNODE |
<INTEGER> |
0 |
exact number of tasks required per node |
| 40 |
QOS |
<INTEGER> |
0 |
quality of service requested |
| 41 |
ENDDATE |
<EPOCHTIME> |
[ANY] |
time by which job must complete |
| 42 |
DPROCS |
<INTEGER> |
1 |
number of processors dedicated per task |
| 43 |
HOSTLIST |
comma or colon delimited list of hostnames |
[ANY] |
list of required hosts on which job must run (NOTE: WikiSpec 1.0 only supports colon delimited hosts) |
| 44 |
SUSPENDTIME |
<INTEGER> |
0 |
Number of seconds job has been suspended |
| 45 |
RESACCESS |
<STRING> |
[NONE] |
Name of reservation in which job must run |
| 46 |
NAME |
<STRING> |
[NONE] |
User specified name of job |
| 47 |
ENV |
<STRING> |
[NONE] |
job environment variables |
| 48 |
INPUT |
<STRING> |
[NONE] |
file containing STDIN |
| 49 |
OUTPUT |
<STRING> |
[NONE] |
file to contain STDOUT |
| 50 |
ERROR |
<STRING> |
[NONE] |
file to contain STDERR |
| 51 |
FLAGS |
<STRING> |
[NONE] |
job flags |
* indicates required field
NOTE 1: job states have the following
definitions:
Idle:
job is ready to run
Running:
job is currently executing
Hold:
job is in the queue but is not allowed to run
Suspended: job has started
but execution has temporarily been suspended
Completed: job has completed
Removed: job has
been cancelled or otherwise terminated externally
NOTE 2: completed and cancelled jobs should
be maintained by the resource manager for a brief time, perhaps 1 to 5
minutes, before being purged. This provides the scheduler time to
obtain all final job state information for scheduler statistics.
StartJob
The 'StartJob' command may only be applied to jobs
in the 'Idle' state. It causes the job to begin running using the
resources listed in the NodeID list.
send CMD=STARTJOB ARG=<JOBID>
TASKLIST=<NODEID>[:<NODEID>]...
receive SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE
>= 0 indicates SUCCESS
STATUSCODE
< 0 indicates FAILURE
RESPONSE
is a text message possibly further describing an error or state
EXAMPLE:
Start job nebo.1 on nodes
cluster001 and cluster002
send 'CMD=STARTJOB ARG=nebo.1
TASKLIST=cluster001:cluster002'
receive 'SC=0;RESPONSE=job
nebo.1 started with 2 tasks'
CancelJob
The 'CancelJob' command, if applied to an active
job, with terminate its execution. If applied to an idle or active
job, the CancelJob command will change the job's state to 'Cancelled'.
send CMD=CANCELJOB ARG=<JOBID>
TYPE=<CANCELTYPE>
<CANCELTYPE> is one of the following:
ADMIN
(command initiated by scheduler administrator)
WALLCLOCK (command initiated by scheduler because
job exceeded its specified wallclock limit)
receive SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE
>= 0 indicates SUCCESS
STATUSCODE
< 0 indicates FAILURE
RESPONSE
is a text message further describing an error or state
EXAMPLE:
Cancel job nebo.2
send 'CMD=CANCELJOB ARG=nebo.2
TYPE=ADMIN'
receive 'SC=0 RESPONSE=job
nebo.2 cancelled'
SuspendJob
The 'SuspendJob' command can only be issued against
a job in the state 'Running'. This command suspends job execution
and results in the job changing to the 'Suspended' state.
send CMD=SUSPENDJOB ARG=<JOBID>
receive SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE
>= 0 indicates SUCCESS
STATUSCODE
< 0 indicates FAILURE
RESPONSE
is a text message possibly further describing an error or state
EXAMPLE:
Resume job nebo.3
send 'CMD=RESUMEJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job
nebo.3 resumed'
ResumeJob
The 'ResumeJob' command can only be issued against
a job in the state 'Suspended'. This command resumes a suspended
job returning it to the 'Running' state.
send CMD=RESUMEJOB ARG=<JOBID>
receive SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE
>= 0 indicates SUCCESS
STATUSCODE
< 0 indicates FAILURE
RESPONSE
is a text message further describing an error or state
EXAMPLE:
Resume job nebo.3
send 'CMD=RESUMEJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job
nebo.3 resumed'
JobAddTask
The 'JobAddTask' command allocates additional tasks
to an active job.
send
CMD=JOBADDTASK ARG=<JOBID>
<NODEID> [<NODEID>]...
receive
SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE
>= 0 indicates SUCCESS
STATUSCODE
< 0 indicates FAILURE
RESPONSE
is a text message possibly further describing an error or state
EXAMPLE:
Add 3 default tasks to job
nebo30023.0 using resources located on nodes cluster002, cluster016, and
cluster112.
send 'CMD=JOBADDTASK ARG=nebo30023.0
DEFAULT cluster002 cluster016 cluster112'
receive 'SC=0 RESPONSE=3
tasks added'
JobReleaseTask
The 'JobReleaseTask' command removes tasks from an
active job.
send
CMD=JOBREMOVETASK ARG=<JOBID>
<TASKID> [<TASKID>]...
receive
SC=<STATUSCODE> RESPONSE=<RESPONSE>
STATUSCODE
>= 0 indicates SUCCESS
STATUSCODE
< 0 indicates FAILURE
RESPONSE
is a text message further describing an error or state
EXAMPLE:
Free resources allocated
to tasks 14, 15, and 16 of job nebo30023.0
send 'CMD=JOBREMOVETASK ARG=nebo30023.0
14 15 16'
receive 'SC=0 RESPONSE=3
tasks removed'
Rejection Codes
- 0xx - success - no error
- 00x - success
- 01x - usage/help reply
- 02x - staus reply
- 020 - general status reply
- 1xx - warning
- 10x - general warning
- 11x - no content
- 110 - general wire protocol or network warning
- 112 - redirect
- 114 - protocol warning
- 12x - no matching results
- 120 - general message format warning
- 122 - incomplete specification (best guess action/response applied)
- 13x - security warning
- 130 - general security warning
- 132 - insecure request
- 134 - insufficient privileges (response was censored/action reduced in scope)
- 14x - content or action warning
- 140 - general content/action warning
- 142 - no content (server has processed the request but there is no data to be returned)
- 144 - no action (no object to act upon)
- 146 - partial content
- 148 - partial action
- 15x - component defined
- 18x - application defined
- 2xx - wire protocol/network failure
- 20x - protocol failure
- 200 - general protocol/network failure
- 21x - network failure
- 210 - general network failure
- 212 - cannot resolve host
- 214 - cannot resolve port
- 216 - cannot create socket
- 218 - cannot bind socket
- 22x - connection failure
- 220 - general connection failure
- 222 - cannot connect to service
- 224 - cannot send data
- 226 - cannot receive data
- 23x - connection rejected
- 230 - general connection failure
- 232 - connection timed-out
- 234 - connection rejected - too busy
- 236 - connection rejected - message too big
- 24x - malformed framing
- 240 - general framing failure
- 242 - malformed framing protocol
- 244 - invalid message size
- 246 - unexpected end of file
- 25x - component defined
- 28x - application defined
- 3xx - messaging format error
- 30x - general messaging format error
- 300 - general messaging format error
- 31x - malformed XML document
- 310 - general malformed XML error
- 32x - XML schema validation error
- 320 - general XML schema validation
- 33x - general syntax error in request
- 330 - general syntax error in response
- 332 - object incorrectly specified
- 334 - action incorrectly specified
- 336 - option/parameter incorrectly specified
- 34x - general syntax error in response
- 340 - general response syntax error
- 342 - object incorrectly specified
- 344 - action incorrectly specified
- 346 - option/parameter incorrectly specified
- 35x - synchronization failure
- 350 - general synchronization failure
- 352 - request identifier is not unique
- 354 - request id values do not match
- 356 - request id count does not match
- 4xx - security error occured
- 40x - authentication failure - client signature
- 400 - general client signature failure
- 402 - invalid authentication type
- 404 - cannot generate security token key - inadequate information
- 406 - cannot cannonicalize request
- 408 - cannot sign request
- 41x - negotiation failure
- 410 - general negotiation failure
- 412 - negotiation request malformed
- 414 - negotiation request not understood
- 416 - negotiation request not supported
- 42x - authentication failure
- 420 - general authentication failure
- 422 - client signature failure
- 424 - server authentication failure
- 426 - server signature failure
- 428 - client authentication failure
- 43x - encryption failure
- 430 - general encryption failure
- 432 - client encryption failure
- 434 - server decryption failure
- 436 - server encryption failure
- 438 - client decryption failure
- 44x - authorization failure
- 440 - general authorization failure
- 442 - client authorization failure
- 444 - server authorization failure
- 45x - component defined failure
- 48x - application defined failure
- 5xx - event management request failure
- 6xx - reserved for future use
- 7xx - server side error occured
- 70x - server side error
- 700 - general server side error
- 71x - server does not support requested function
- 710 - server does not support requested function
- 72x - internal server error
- 720 - general internal server error
- 73x - resource unavailable
- 730 - general resource unavailable error
- 732 - software resource unavailable error
- 734 - hardware resource unavailable error
- 74x - request violates policy
- 740 - general policy violation
- 75x - component-defined failure
- 78x - application-defined failure
- 8xx - client side error occured
- 80x - general client side error
- 800 - general client side error
- 81x - request not supported
- 810 - request not supported
- 82x - application specific failure
- 820 - general application specific failure
- 9xx - miscellaneous
- 90x - general miscellaneous error
- 900 - general miscellaneous error
- 91x - general insufficient resources error
- 910 - general insufficient resources error
- 99x - general unknown error
|