Job Holds
Moab Workload Manager®

11.1 Job Holds

11.1.1 Holds and Deferred Jobs

A job hold is a mechanism by which a job is placed in a state where it is not eligible to be run. Moab supports job holds applied by users, administrators, and even resource managers. These holds can be seen in the output of the showq and checkjob commands. A job with a hold placed on it cannot be run until the hold is removed. If a hold is placed on a job via the resource manager, this hold must be released by the resource manager provided command—llhold for Loadleveler or qhold for PBS.

Moab supports two other types of holds. The first is a temporary hold known as a defer. A job is deferred if the scheduler determines that it cannot run. This can be because it asks for resources that do not currently exist, does not have allocations to run, is rejected by the resource manager, repeatedly fails after start up, and so forth. Each time a job gets deferred, it will stay that way, unable to run for a period of time specified by the DEFERTIME parameter. If a job appears with a state of deferred, it indicates one of the previously mentioned failures has occurred. Details regarding the failure are available by issuing the checkjob <JOBID> command. Once the time specified by DEFERTIME has elapsed, the job is automatically released and the scheduler again attempts to schedule it. The defer mechanism can be disabled by setting DEFERTIME to zero (0). To release a job from the defer state, issue releasehold -a <JOBID>.

The second Moab-specific type of hold is known as a batch hold. A batch hold is only applied by the scheduler and is only applied after a serious or repeated job failure. If a job has been deferred and released DEFERCOUNT times, Moab places it in a batch hold. It remains in this hold until a scheduler administrator examines it and takes appropriate action. Like the defer state, the causes of a batch hold can be determined via checkjob and the hold can be released via releasehold.

Like most schedulers, Moab supports the concept of a job hold. Actually, Moab supports four distinct types of holds: user holds, system holds, batch holds, and defer holds. Each of these holds effectively blocks a job, preventing it from running, until the hold is removed.

11.1.2 User Holds

User holds are very straightforward. Many, if not most, resource managers provide interfaces by which users can place a hold on their own job that tells the scheduler not to run the job while the hold is in place. Users may use this capability because the job's data is not yet ready, or they want to be present when the job runs to monitor results. Such user holds are created by, and under the control of a non-privileged user and may be removed at any time by that user. As would be expected, users can only place holds on their jobs. Jobs with a user hold in place will have a Moab state of Hold or UserHold depending on the resource manager being used.

11.1.3 System Holds

The system hold is put in place by a system administrator either manually or by way of an automated tool. As with all holds, the job is not allowed to run so long as this hold is in place. A batch administrator can place and release system holds on any job regardless of job ownership. However, unlike a user hold, normal users cannot release a system hold even on their own jobs. System holds are often used during system maintenance and to prevent particular jobs from running in accordance with current system needs. Jobs with a system hold in place will have a Moab state of Hold or SystemHold depending on the resource manager being used.

11.1.4 Batch Holds

Batch holds are placed on a job by the scheduler itself when it determines that a job cannot run. The reasons for this vary but can be displayed by issuing the checkjob <JOBID> command. Possible reasons are included in the following list:

  • No Resources — The job requests resources of a type or amount that do not exist on the system.
  • System Limits — The job is larger or longer than what is allowed by the specified system policies.
  • Bank Failure — The allocations bank is experiencing failures.
  • No Allocations — The job requests use of an account that is out of allocations and no fallback account has been specified.
  • RM Reject — The resource manager refuses to start the job.
  • RM Failure — The resource manager is experiencing failures.
  • Policy Violation — The job violates certain throttling policies preventing it from running now and in the future.
  • No QOS Access — The job does not have access to the QoS level it requests.

Jobs which are placed in a batch hold will show up within Moab in the state BatchHold.

11.1.5 Job Defer

In most cases, a job violating these policies is not placed into a batch hold immediately; rather, it is deferred. The parameter DEFERTIME indicates how long it is deferred. At this time, it is allowed back into the idle queue and again considered for scheduling. If it again is unable to run at that time or at any time in the future, it is again deferred for the timeframe specified by DEFERTIME. A job is released and deferred up to DEFERCOUNT times at which point the scheduler places a batch hold on the job and waits for a system administrator to determine the correct course of action. Deferred jobs have a Moab state of Deferred. As with jobs in the BatchHold state, the reason the job was deferred can be determined by use of the checkjob command.

At any time, a job can be released from any hold or deferred state using the releasehold command. The Moab logs should provide detailed information about the cause of any batch hold or job deferral.

NOTE: Under Moab, the reason a job is deferred or placed in a batch hold is stored in memory but is not checkpointed. Thus this information is available only until Moab is recycled at which point the checkjob command no longer displays this reason information.

See Also

  • DEFERSTARTCOUNT - number of job start failures allowed before job is deferred