TORQUE Administrator's Manual - Introduction to TORQUE Resource Manager
Introduction to TORQUE Resource Manager
This section provides a starting point for users and administrators alike. It covers topics from So I have access to a supercomputer, now what? to basic job flow. It also covers terms and explanation about typical environments.
What is a Resource Manager?
While TORQUE has a built-in scheduler, pbs_sched, it is typically used solely as a resource manager, with a scheduler making requests to it. Resources managers provide the low-level functionality to start, hold, cancel and monitor jobs. Without these capabilities, a scheduler alone can not control jobs.
What are Batch Systems?
While TORQUE is flexible enough to handle scheduling a conference room, it is primarily used in batch systems. Batch systems are a collection of computers and other resources (networks, storage systems, license servers, etc.) with idea that the whole is greater than the sum of the parts. Some batch systems consist of just a handful of machines running single-processor jobs, minimally managed by the users themselves. Other systems have thousands ands thousands of machines executing users’ jobs simultaneously while tracking software licenses and access to hardware equipment and storage systems.
Pooling resources in a batch system typically reduces technical administration of resources and a uniform view to users. Once configured properly, batch systems abstract away many of the details involved with running and managing jobs, allowing higher utilization of resources. For example, users typically only need to specify the minimal constraints of a job and do not need to know the individual machine names of each host that they are running on. With this uniform abstracted view, batch systems can execute thousands and thousands of jobs simultaneously.
Batch systems are comprised of four different types components: * Master Node * Submit/Interactive Nodes * Compute Nodes * Resources
Master Node
A batch system will have a master node where pbs_server is running. Depending on the needs of the systems, a master node may be dedicated to this task or may fulfill the roles of other components as well.
Submit/Interactive Nodes
Submit or interactive nodes provide an entry point to the system for users to be able to manage their workload. For these nodes, users are able to submit and track their jobs. Additionally, some sites have one or more nodes reserved for interactive use, such as testing and troubleshooting environment problems. These nodes will have client commands (e.g., qsub, qhold, etc) available.
Compute Nodes
Compute Nodes are the work horses of the system. Their role is to execute submitted jobs. On each compute node, pbs_mom will be running to start, kill and manage submitted jobs. It communicates with pbs_server on the master node. Depending on the needs of the systems, a compute node may double as the master node (or more).
Resources
Some systems are organized for the express purpose of managing a collection of resources beyond compute nodes. Resources can include high-speed networks, storage systems, license managers, etc. Availability of these resources is limited and need to be managed intelligently to promote fairness and increased utilization.
Basic Job Flow
The life cycle of a job can be divided into four stages: * creation * submission * execution * finalization
Creation
Typically, a submit script is written to hold all of the parameters of a job. These parameters could include how long a job should run (i.e., walltime), what resources are necessary to run and what to execute. Below is an example submit file:
This submit script specifies the name of the job (localBlast), what environment to use (/bin/sh), that it need both processors on a single node (nodes=1:ppn=2), will run for at most 10 days, and that TORQUE should email user@my.organization.com when the job exits or aborts. Additionally, the user specified where and what to execute.
Submission
A job is submitted with the command qsub. Once submitted, the policies set by the administration and technical staff of the site will dictate the priority of the job and therefore, when it will start executing.
Execution
Jobs often spend most of their life cycle executing. While a job is running, its status can be queried with qstat.
Finalization
When a job has completed, by default, the stdout and stderr files will be copied to the directory where the job was submitted.