[torqueusers] Scalability issues with pbs_sched_cc

Eric D. Blom ebx at cypress.com
Sat Nov 27 15:46:47 MST 2004


Ronny,
We have experienced problems when submitting large numbers of small 
jobs to our system. We have about 35 nodes and when we submit say 
10,000 jobs that average 5 minutes each the system struggles to keep 
all nodes busy. I haven't had time to investigate though.

Eric



On Nov 27, 2004, at 5:42 AM, Ronny T. Lampert wrote:

> Hi,
>
> I noticed the pbs_sched quitting again the 10th time today because of 
> "too long" *).
> I even set the delay via "-a 400" and now 800 to try if this helps 
> (does not).
> The pbs_server was instructed to run the scheduler each 480s, now 900s.
> (The server/sched is on a node for the queue)
>
> The queue currently holds around 760 jobs.
> When tracing the pbs_sched via strace, I noticed, that does the 
> following cycle:
>
> select() -> read() -> write()
>
> and it seems it does it for one job at a time; the timespan is around 
> 1s /
> cycle (which means, we have >= 700 seconds for 700 jobs, right?)
>
> Could we remedy the problem by bursting a set(100, even 500 or more) of
> job-descriptions, then the scheduler sorting it (this shouldn't really 
> take
> long) and then bursting the job-set back to the server?
>
> Does anybody else have these problems?
> If you need more info, I will happily supply it.
>
> Kind regards,
> Ronny
>
>
>
> *) because I have setup a privat dir (/usr/local/torque-1.1.0), where 
> the whole installation is isolated, the scheduler couldn't restart 
> itself and so I really noticed.
> It resets the environment (also $PATH) to the contents of 
> "pbs_environment".
> The pbs_server and _sched are started via
>
> PATH="...:/usr/local/torque-1.1.0/sbin" pbs_sched
>
> and as such can't execv argv[0], because it is not the full path.
> This is not a problem, as I patched it in 5 minutes.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>
>
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Eric D. Blom            Toll Free: 800-669-0557
Senior Staff Design Engineer  Tel: 425-787-4825
Cypress MicroSystems          Fax: 425-787-4641
ebx at cypress.com            www.cypressmicro.com
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



More information about the torqueusers mailing list