|
|||
1.0 What Is a Batch SystemA batch system is a set of tools which allows one or more users to effectively share the compute power of a cluster. At a high level, it allows simple remote access to these resources and coordinates actions to prevent conflicts between users, improve cluster efficiency, and handle system failures. Using a batch system, users can learn about the health of the cluster, send new requests (jobs) to the cluster, and query and control previous requests.1.1 Batch vs Interactive JobsMany people are familiar with logging into a computer, and directly starting a particular program or application. In many cases, they will interact with that application through a user interface indicating that a certain set of data (a file) should be loaded and processed. By logging into a computer, the user is selecting which compute resources should be used to run the job. Before he launches the application, he may enter a particular directory, locate certain files, or otherwise prepare the environment so the application will run successfully.With a batch system, you can do the exact same thing. In the case of TORQUE this is done using the qsub command. The qsub command allows users to specify which host they want to run on using the '-l nodes=<val>' command line argument. In this case, we told TORQUE to run on node node01. Also, once we entered the qsub command, TORQUE prompted us for information about what command to run. In the example above, the command hostname was specified followed by the <control-D> key sequence to indicate that we have completed command entry. Once the <control-D> key sequence was entered, qsub reported a job identifier or jobid. Using this jobid, the user could then view the status of his job, cancel the job, or modify the job. Once the jobid is reported, the qsub command goes away and the terminal command prompt returns. Assuming that node01 is not completely full, the scheduler (Moab) will start this job immediately. Once it runs, the output and error messages the command normally generates will be placed in a stdout and stderr file respectively. In this case, the files will be located in the user's home directory in the files STDIN.o13443 and STDIN.e13443 I Want to Interact with My Program In the example above, the application was specified when the qsub command was run. In some cases, users may not know every command they want to run ahead of time or may want to interact with their program. These types of jobs are called interactive jobs and are specified with TORQUE using the '-I' command as in the example below: What If I Don't Know Which Machine I Want to Run On? The qsub command can be told exactly which machine to use, or can select the machine for you. This is done by setting qsub's '-l nodes=' value to a number rather than a hostname. For example, to request that the batch system select a host for you, use the following command: What Happens If My Job Cannot Run Immediately So far, we have assumed that a machine (node) in the cluster was always available to immediately start the user jobs. However, in many cases, this will not be true. Consequently, a user may submit a job and then have to wait for a node to free up. In these cases, he can use the showq command to see the status of his jobs. The output indicates that job 47 is now running meaning that the application (in this case 'sleep 500') is currently executing. We also learn that there are a total of 10 machines (aka nodes) available in this cluster and that the job is going to be running for almost another hour. By default, Moab will only run one job per processor to prevent applications from interfering with each other. Consequently, we know we will need to submit 10 more jobs before we can have an idle job. If we do this and run the showq command again, we will see that 10 jobs are now running, and one job is now listed in the eligible jobs section. Jobs in this section of the showq output are eligible to run as soon as compute resources free up. I Want to Cancel a Job? In the above example, we just submitted 11 jobs which do nothing but sleep. These jobs are not doing any harm but have completely plugged the cluster with worthless jobs. To cancel or delete the jobs, use the qdel command as in the following example. If the showq command is now run, you will see that the jobs are now removed I Want to Run a Job Which Uses Multiple Machines? So far, all of our examples have only requested a single node (or machine). However, many clusters are commonly used for jobs which use multiple nodes at the same time. Asking for multiple nodes is specified using qsub's '-l nodes=' value as in the example below By specifying 'nodes=10', you asked the scheduler to allocate these nodes to your job. It is still the responsibility of the job to take advantage of and utilize these nodes. Most frequently, this is accomplished using the pbsdsh command to start programs on each of the nodes or by using a parallel library such as MPI and the hostfile generated by TORQUE. I Want to Learn More Details about the Nodes in the Cluster? Over time, want to learn details about cluster want to request specific node types want to specify job environment want to learn what blocks a job want to see job completion time want to learn about cluster queues or policies What Happens If My Job Cannot Run Immediately
|
|||
| © 2001-2008 Cluster Resources, Incorporated | |||