[torqueusers] Slot limit issues (still)
Andrus, Brian Contractor
bdandrus at nps.edu
Tue Sep 24 10:44:06 MDT 2013
Yes, they are in a blocked state (batch hold) per section 10-e of the mwm documentation:
In most cases, a job violating these policies is not placed into a batch hold immediately; rather, it is deferred. The parameterDEFERTIME<http://docs.adaptivecomputing.com/mwm/Content/a.fparameters.html#defertime> indicates how long it is deferred. At this time, it is allowed back into the idle queue and again considered for scheduling. If it again is unable to run at that time or at any time in the future, it is again deferred for the timeframe specified by DEFERTIME. A job is released and deferred up to DEFERCOUNT<http://docs.adaptivecomputing.com/mwm/Content/a.fparameters.html#defercount> times at which point the scheduler places a batch hold on the job and waits for a system administrator to determine the correct course of action. Deferred jobs have a Moab state of Deferred. As with jobs in the BatchHold state, the reason the job was deferred can be determined by use of the checkjob command.
At any time, a job can be released from any hold or deferred state using the releasehold<http://docs.adaptivecomputing.com/mwm/Content/commands/releasehold.html> command. The Moab logs should provide detailed information about the cause of any batch hold or job deferral.
Oddly, in my case, I can get no info from the checkjob command:
NOTE: job cannot run (job has hold in place)
BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling iteration)
And doing releasehold doesn't help:
[root at hamming ~]# releasehold -a 20139590
holds not modified for job 20139590 ( hold still in place)
So it seems, somehow, somewhere, torque thinks this user/job has 512 slots already taken...
Naval Postgraduate School
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson
Sent: Tuesday, September 24, 2013 9:14 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Slot limit issues (still)
On Tue, Sep 24, 2013 at 9:30 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:
Ok, This one is still going on with the same array job.
I have many array jobs (same parent job) that have gone into a 'blocked' status because they couldn't start in a timely manner (DEFERTIME/DEFERCOUNT). Not unsual for a sizeable array job with slot limits (set server max_slot_limit = 512).
So I want to start some of these jobs. The user has NO jobs currently running (there ARE other jobs running, only 5 are other array jobs, but a different user).
I am trying with job 20139590
Here is what I try/get:
[root at cluster ~]# qrls 20139590
[root at cluster ~]# qrun 20139590
qrun: Invalid request MSG=Cannot run job. Array slot limit is 512 and there are already 512 jobs running
[root at cluster ~]# qrerun 20139590
qrerun: Request invalid for state of job MSG=job 20139590.cluster is in a bad state 20139590.cluster
I have tried restarting pbs_server and looked at the output of pbsnodes to see if there are any of this job floating around, but there is not. Also checked on each node for anything for that job/user.. Nothing there as well.
Any ideas what is going on here and/or how to get these jobs running?
Naval Postgraduate School
I see you are doing a qrls on the job before running the job. So these jobs are on hold before they run. Correct?
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers