[torqueusers] Scheduler efficiency

garrick at speculation.org garrick at speculation.org
Mon Jun 12 13:24:52 MDT 2006


On Fri, Jun 09, 2006 at 12:22:34PM +1000, Franc Carter alleged:
> Hi,
> 
> We are using torque-1.2 with a site specific TCL scheduling algorithm. The
> number
> of jobs in the queue has grown significantly since we implemented (several
> thousand)
> and the scheduler takes a long time to make a decision and uses lots of CPU
> time.
> 
> Part of the problem appears to be that on every cycle the scheduler needs to
> completely reread the entire state instead of being able to find out just
> the
> change that caused the scheduler to be invoked - i.e job 1234 exited.
> 
> I had a look through the source code and it looks like this information is
> not available in the protocol - but my C is rather rusty.
> 
> Can someone confirm that this information is not available to the scheduler,
> and is
> it available in the 2.0 version. More importantly is anyone running a
> scheduler that
> works 'efficiently' in the 1000's of jobs range.

Unfortunately, that is just how it works.  Each scheduling iteration
must call pbs_statjob() and "download" all job info.

I've been thinking that it would be nice to have a second version of the
pbs_stat*() functions that save their own state inside of pbs_server and
only return changes (as long as the connection is maintained.)




More information about the torqueusers mailing list