[torqueusers] Scalability issues with pbs_sched_cc
Ronny T. Lampert
telecaadmin at uni.de
Sat Nov 27 06:42:24 MST 2004
Hi,
I noticed the pbs_sched quitting again the 10th time today because of "too
long" *).
I even set the delay via "-a 400" and now 800 to try if this helps (does not).
The pbs_server was instructed to run the scheduler each 480s, now 900s.
(The server/sched is on a node for the queue)
The queue currently holds around 760 jobs.
When tracing the pbs_sched via strace, I noticed, that does the following cycle:
select() -> read() -> write()
and it seems it does it for one job at a time; the timespan is around 1s /
cycle (which means, we have >= 700 seconds for 700 jobs, right?)
Could we remedy the problem by bursting a set(100, even 500 or more) of
job-descriptions, then the scheduler sorting it (this shouldn't really take
long) and then bursting the job-set back to the server?
Does anybody else have these problems?
If you need more info, I will happily supply it.
Kind regards,
Ronny
*) because I have setup a privat dir (/usr/local/torque-1.1.0), where the
whole installation is isolated, the scheduler couldn't restart itself and so I
really noticed.
It resets the environment (also $PATH) to the contents of "pbs_environment".
The pbs_server and _sched are started via
PATH="...:/usr/local/torque-1.1.0/sbin" pbs_sched
and as such can't execv argv[0], because it is not the full path.
This is not a problem, as I patched it in 5 minutes.
More information about the torqueusers
mailing list