[torqueusers] Re: [Mauiusers] suspend / resume

Gerson Galang gerson.sapac at gawab.com
Sun Aug 1 23:54:46 MDT 2004


Hi,

I tried your patch and it worked on our test cluster. However, I need to 
manually run the job using the qrun command because even if the server 
already frees up the nodes with suspended jobs in them, the next job in 
the queue still doesn't get executed. This only happens when the number 
of requested nodes is more than TOTAL_NUM_OF_COMPUTE_NODES - 
NODES_WITH_SUSPENDED_JOBS. Here's the result of doing a "checkjob 
<jobid>" on the next job in the queue that doesn't automatically get 
executed.

...
Reservation '815' (00:58:44 -> 1:58:44  Duration: 1:00:00)
PE:  6.00  StartPriority:  1
job cannot run in partition DEFAULT (idle procs do not meet requirements 
: 0 of 6 procs found)
idle procs:   6  feasible procs:   0
Rejection Reasons: [ReserveTime  :    6]

Does anybody else have a patch to set the state of the processes to idle?

Another thing that we have noticed here when we suspend jobs is that a 
job's walltime  still continues to decrease even if that job has already 
been suspended. Is there a way of stopping the wall clock time of a 
suspended job?

Thanks,
Gerson

Bernward Platz wrote:
> I think this is a problem in req_signal.c, because
> when a job is suspended the nodes allocated by the job are not released.
> I wrote a short patch to solve this problem. The important call in 
> req_signal.c is "free_nodes". 
> The path is not well tested yet. But I used the patch several times without 
> problems.
> 
> Regards
> 
> Bernward
> 
> 
> 
> diff -urN -X exclude torque-1.0.1.org/src/server/req_signal.c torque-1.0.1/
> src/server/req_signal.c
> --- torque-1.0.1.org/src/server/req_signal.c    2004-02-13 20:01:00.000000000 
> +0100
> +++ torque-1.0.1/src/server/req_signal.c        2004-03-20 10:01:13.000000000 
> +0100
> @@ -206,8 +206,10 @@
>                         pjob->ji_qs.ji_svrflags |= JOB_SVFLG_Suspend;
>                         set_statechar(pjob);
>                         job_save(pjob, SAVEJOB_QUICK);
> +                        free_nodes(pjob);
>                 } else if (strcmp(preq->rq_ind.rq_signal.rq_signame,
>                            SIG_RESUME) == 0) {
> +                        set_old_nodes(pjob);
>                         pjob->ji_qs.ji_svrflags &= ~JOB_SVFLG_Suspend;
>                         set_statechar(pjob);
>                         job_save(pjob, SAVEJOB_QUICK);
> 
> 
> 
> On Wednesday 28 July 2004 10:50, Sébastien Georget wrote:
> 
>>Hi,
>>
>>   I am trying to use maui/torque suspend feature. Right now I can
>>suspend/resume jobs using qsig -s suspend/resumeJOBID or mjobctl -s/-r
>>JOBID.
>>The problem is that the nodes where the suspended job runs are still in
>>the state 'job-exclusive' and cannot be used to submit new jobs. I
>>wonder which one of maui or torque has a faulty behaviour here.
>>Should torque change the state of the node to free when the job is
>>suspended, or should it be maui ? Can it be configured somewhere ?
>>
>>thx,
>>Sébastien
> 
> 



More information about the torqueusers mailing list