[torqueusers] Dependency jobs get system hold

Ronny T. Lampert telecaadmin at uni.de
Thu Aug 17 04:04:53 MDT 2006


Hi,

using torque 2.1.2 + maui-3.2.6p16 jobs having dependencies suddenly get a
system hold which can be confusing for the administrator.
Please consider the following 2 outputs from qstat and checkjob.

#> checkjob 350236
checking job 350236

State: Hold
Creds:  user:USER  group:GROUP  class:default  qos:low
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Thu Aug 17 11:33:07
  (Time Queued  Total: 00:16:36  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE
Attr:        PREEMPTEE

PE:  1.00  StartPriority:  1
cannot select job 350236 for partition DEFAULT (non-idle state 'Hold')


#> qstat -f 350236

Job Id: 350236.SERVER
    Job_Name = e00018
    job_state = H
    queue = default
    server = SERVER
    Checkpoint = u
    ctime = Thu Aug 17 11:33:07 2006
    depend = afterany:350235.SERVER at SERVER
    [...]
    Hold_Types = s
    [...]


If I look at checkjob I realize that something is wrong with the job,
because it is in HOLD state.
Then I look at the Hold_Types in qstat and see: "SYSTEM HOLD" and conclude,
something has gone wrong. If I overlook the "depend=" line...


Now some questions:

1) Do these jobs follow the usual DEFER-routines with retry and DEFERTIME
checking? Or does maui magically know that this is NOT a deferred job?
*I* would think it is one.

2) I think a USER hold would be much more to the point. Or a new type,
DEPENDENCY-HOLD.

3) Could this somehow be made more clear to the administrator? Would be
great if the checkjob just said

"cannot select job 350236 for partition DEFAULT (non-idle state 'Hold') -
5 of 10 job-dependencies not fulfilled" or something.

That would prevent me (and others?) from wondering and also, from having to
manually use qstat AND checkjob.

Cheers,
Ronny




More information about the torqueusers mailing list