SSS Job Object Specification

Draft Release Version 3.0.3

3 AUG 2004

 

Scott Jackson, PNNL

David Jackson, Ames Lab

Brett Bode, Ames Lab

 



Scalable Systems Software Job Object Specification

 

 

Status of this Memo

 

This document describes the job object to be used by Scalable Systems Software compliant components. It is envisioned for this specification to be used in conjunction with the SSSRMAP protocol with the job object passed in the Data field of Requests and Responses. Queries can be issued to a job-cognizant component in the form of modified XPATH expressions to the Get field to extract specific information from the job object as described in the SSSRMAP protocol.

 

Abstract

 

This document describes the syntax and structure of the SSS job object. A job model is described that is flexible enough to support the specification of very simple jobs as well as jobs with elaborate and complex specification requirements in a way that avoids complex structures and syntax when it is not needed. The basic assumption is that a solitary job specification should be usable for all phases of the job lifecycle and can be used at submission, queuing, staging, reservations, quotations, execution, charging, accounting, etc. This job specification provides support for multi-step jobs, as well as jobs with disparate task descriptions. It accounts for operational requirements in a grid or meta-scheduled environment where the job is executed by multiple hosts in different administrative domains that support different resource management systems.

Table of Contents

 

Scalable Systems Software Job Object Specification  1

Table of Contents. 1

1........ Introduction. 3

1.1Goals. 3

1.2       Non-Goals. 3

1.3       Examples. 4

1.3.1        Very Simple Example. 4

1.3.2        Moderate Example. 4

1.3.3        Elaborate Example. 5

2........ Conventions used in this document 7

2.1       Keywords. 7

2.2       Table Column Interpretations. 7

2.3       Element Syntax Cardinality. 8

3........ The Job Model 8

4........ JobGroup Element 10

4.1       JobGroup Properties. 10

4.1.1        Simple JobGroup Properties. 11

4.1.2        Job. 11

4.1.3        JobDefaults. 11

5........ Job and JobDefaults Element 11

5.1       Job Properties. 12

5.1.1        Simple Job Properties. 12

5.1.1.1     ResourceLimit Element 16

5.1.2        Credentials. 17

5.1.3        Environment Element 17

5.1.3.1     Variable Element 18

5.1.4        NodeList Element 18

5.1.4.1     Node Element 18

5.1.5        TaskDistribution Element 19

5.1.6        Dependency Element 19

5.1.7        Consumable Resources. 20

5.1.8        Resource Element 21

5.1.9        NodeProperties Element 21

5.1.9.1     Node Properties. 22

5.1.10      Extension Element 22

5.1.11      TaskGroup. 23

5.1.12      TaskGroupDefaults. 23

6........ TaskGroup and TaskGroupDefaults Element 23

6.1       TaskGroup Properties. 24

6.1.1        Simple TaskGroup Properties. 24

6.1.2        Task. 24

6.1.3        TaskDefaults. 25

7........ Task and TaskDefaults Element 25

7.1       Task Properties. 25

7.1.1        Simple Task Properties. 25

8........ Property Categories. 26

8.1       Requested Element 26

8.2       Delivered Element 28

9........ AwarenessPolicy Attribute. 29

10..... References. 30

Appendix A.. 31

Units of Measure Abbreviations. 31

 

1.      Introduction

 

This specification proposes a standard XML representation for a job object for use by the various components in the SSS Resource Management System. This object will be used in multiple contexts and by multiple components. It is anticipated that this object will be passed via the Data Element of SSSRMAP Requests and Responses.

1.1      Goals

 

There are several goals motivating the design of this representation.

 

The representation needs to be inherently flexible. We recognize we will not be able to exhaustively include the ever-changing job properties and capabilities that constantly arise.

 

The representation should use the same job object at all stages of that job’s lifecycle. This object will be used at job submission, queuing, scheduling, charging and accounting, hence it may need to distinguish between requested and delivered properties.

 

The design must account for the properties and structure required to function in a meta or grid environment. It needs to include the capability to support local mapping of properties, global namespaces, etc.

 

The equivalent of multi-step jobs must be supported. Each step (job) can have multiple logical task descriptions.

 

Many potential users of the specification will not be prepared to implement the complex portions or fine-granularity that others need. There needs to be a way to allow the more complicated structure to be added as needed while leaving more straightforward cases simple.

 

There needs to be guidance for how to understand a given job object when higher order features are not supported by an implementation, and which parts are required, recommended and optional for implementers to implement.

 

It needs to support composite resources.

 

It should include the ability to specify preferences or fuzzy requirements.

 

1.2      Non-Goals

 

Namespace considerations and naming conventions for most property values are outside of the scope of this document.

1.3      Examples

 

1.3.1      Very Simple Example

 

This example shows a simple job object that captures the requirements of a simple job.

 

<Job>

            <JobId>PBS.1234.0</JobId>

            <JobState>Idle</JobState>

            <UserId>scottmo</UserId>

            <Executable>/bin/hostname</Executable>

            <Processors>16</Processors>

            <WallDuration>3600</WallDuration>

</Job>

 

1.3.2      Moderate Example

 

This example shows a moderately complex job object that uses features such as required versus delivered properties.

 

<Job>

            <JobId>PBS.1234.0</JobId>

            <JobName>Heavy Water</JobName>

            <ProjectId>nwchemdev</ProjectId>

            <UserId>peterk</UserId>

            <Application>NWChem</Application>

            <Executable>/usr/local/nwchem/bin/nwchem</Executable>

            <Arguments>-input basis.in</Arguments>

            <InitialWorkingDirectory>/home/peterk</InitialWorkingDirectory>

            <MachineName>Colony</MachineName>

            <QualityOfService>BottomFeeder</QualityOfService>

<Queue>batch_normal</Queue>

<JobState>Completed</JobState>

            <StartTime>1051557713</StartTime>

<EndTime>1051558868</EndTime>

<Charge>25410</Charge>

<Requested>

<Processors op=”ge”>12</Processors>

<Memory op=”ge” units=”GB”>2</Memory>

<WallDuration>3600</WallDuration>

            </Requested>

            <Delivered>

                        <Processors>16</Processors>

                        <Memory metric=”Average” units=”GB”>1.89</Memory>

<WallDuration>1155</WallDuration>

            </Delivered>

            <Environment>

                        <Variable name=”PATH”>/usr/bin:/home/peterk</Variable>

            </Environment>

</Job>

 

1.3.3      Elaborate Example

 

This example uses a job group to encapsulate a multi-step job. It shows this protocol’s ability to characterize complex job processing capabilities.  A component that processes this message is free to retain only that part of the information that it requires. Superfluous information can be ignored by the component or filtered out (by XSLT for example).

 

<JobGroup>

            <JobGroupId>fr15n05.1234</JobGroupId>

            <JobGroupState>Active</JobGroupState>

            <JobGroupName>ShuttleTakeoff</JobGroupName>

            <JobDefaults>

                        <StagedTime>1051557859</StagedTime>

                        <SubmitHost>asteroid.lbl.gov</SubmitHost>

                        <SubmissionTime>1051556734</SubmissionTime>

                        <ProjectId>GrandChallenge18</ProjectId>

                        <GlobalUserId>C=US,O=LBNL,CN=Keith Jackson</GlobalUserId>

                        <UserId>keith</UserId>

                        <Environment>

                                    <Variable name=”LD_LIBRARY_PATH”>/usr/lib</Variable>

                                    <Variable name=”PATH”>/usr/bin:~/bin:</Variable>

                        <Environment>

            </JobDefaults>

            <Job>

                        <JobId>fr15n05.1234.0</JobId>

                        <JobName>Launch Vector Initialization</JobName>

                        <Executable>/usr/local/gridphys/bin/lvcalc</Executable>

                        <Queue>batch</Queue>

                        <JobState>Completed</JobState>

                        <MachineName>SMP2.emsl.pnl.gov</MachineName>

                        <StartTime>1051557713</StartTime>

                        <EndTime>1051558868</EndTime>

                        <QuoteId>http://www.pnl.gov/SMP2#654321</QuoteId>

                        <Charge units=”USD”>12.75</Charge>

                        <Requested>

                                    <WallDuration>3600</WallDuration>

                                    <Processors>2</Processors>

                                    <Memory>1024</Memory>

                        </Requested>

                        </Delivered>

                                    <WallDuration>1155</WallDuration>

                                    <Processors consumptionRate=”0.78”>2</Processors>

                                    <Memory metric=”max”>975</Memory>

                        </Delivered>

                        <TaskGroup>

                                    <TaskCount>2</TaskCount>

                                    <TaskDistribution type=”TasksPerNode”>1</TaskDistribution>

                                    <Task>

                                                <Node>node1</Node>

                                                <ProcessId>99353</ProcessId>

                                    </Task>

                                    <Task>

                                                <Node>node12</Node>

                                                <ProcessId>80209</ProcessId>

                                    </Task>

                        </TaskGroup>

            </Job>

            <Job>

                        <JobId>fr15n05.1234.1</JobId>

                        <JobName>3-Phase Ascension</JobName>

                        <Queue>batch_normal</Queue>

                        <JobState>Idle</JobState>

                        <MachineName>Colony.emsl.pnl.gov</MachineName>

                        <Priority>1032847</Priority>

                        <Hold>System</Hold>

                        <StatusMessage>Insufficient funds to start job</StatusMessage>

                        <Requested>

                                    <WallDuration>43200</WallDuration>

                        </Requested>

                        <TaskGroup>

                                    <TaskCount>1</TaskCount>

                                    <TaskGroupName>Master</TaskGroupName>

                                    <Executable>/usr/local/bin/stage-coordinator</Executable>

                                    <Memory>2048<Memory>

                                    <Resource name=”License” type=”ESSL2”>1</Resource>

                                    <NodeProperties>

                                                <Feature>Jumbo-Frame</Feature>

                                    </NodeProperties>

                        </TaskGroup>

                        <TaskGroup>

                                    <TaskGroupName>Slave</TaskGroupName>

                                    <TaskDistribution type=”Rule”>RoundRobin</TaskDistribution>

                                    <Executable>/usr/local/bin/stage-slave</Executable>

                                    <NodeCount>4</NodeCount>

                                    <Requested>

                                                <Processors group=”-1”>12</Processors>

                                                <Processors conj=”or” group=”1”>16</Processors>

                                                <Memory>512</Memory>

                                                <NodeProperties>

                                                            <Name op=”match”>fr15n.*</Name>

                                                </NodeProperties>

                                    </Requested>  

                        </TaskGroup>

            </Job>

</JobGroup>

 

2.      Conventions used in this document

 

2.1      Keywords

 

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC2119.

 

2.2      Table Column Interpretations

 

The columns of the property tables in this document have the following meanings:

 

Element Name: Name of the XML element (xsd:element) see [DATATYPES]

 

Type:                            Data type defined by xsd (XML Schema Definition) as:

 

String               xsd:string (a finite length sequence of printable characters)

            Integer              xsd:integer (a signed finite length sequence of decimal digits)

            Float                xsd:float (single-precision 32-bit floating point)

            Boolean            xsd:boolean (consists of the literals “true” or “false”)

            DateTime         xsd:int   (a 32-bit unsigned long in GMT seconds since the EPOCH)

Duration           xsd:int (a 32-bit unsigned long measured in seconds)

 

Description:                  Brief description of the meani