[torqueusers] Scalability issues with pbs_sched_cc

Ronny T. Lampert telecaadmin at uni.de
Sat Nov 27 06:42:24 MST 2004


I noticed the pbs_sched quitting again the 10th time today because of "too 
long" *).
I even set the delay via "-a 400" and now 800 to try if this helps (does not).
The pbs_server was instructed to run the scheduler each 480s, now 900s.
(The server/sched is on a node for the queue)

The queue currently holds around 760 jobs.
When tracing the pbs_sched via strace, I noticed, that does the following cycle:

select() -> read() -> write()

and it seems it does it for one job at a time; the timespan is around 1s /
cycle (which means, we have >= 700 seconds for 700 jobs, right?)

Could we remedy the problem by bursting a set(100, even 500 or more) of
job-descriptions, then the scheduler sorting it (this shouldn't really take
long) and then bursting the job-set back to the server?

Does anybody else have these problems?
If you need more info, I will happily supply it.

Kind regards,

*) because I have setup a privat dir (/usr/local/torque-1.1.0), where the 
whole installation is isolated, the scheduler couldn't restart itself and so I 
really noticed.
It resets the environment (also $PATH) to the contents of "pbs_environment".
The pbs_server and _sched are started via

PATH="...:/usr/local/torque-1.1.0/sbin" pbs_sched

and as such can't execv argv[0], because it is not the full path.
This is not a problem, as I patched it in 5 minutes.

More information about the torqueusers mailing list