[torqueusers] Recommendation for production clusters
csamuel at vpac.org
Fri Jun 27 07:34:40 MDT 2008
----- "Bu Khamsin, Ahmed M" <ahmed.bukhamsin at aramco.com> wrote:
> Hi all,
> In our data center we are running different versions of torque in our
> clusters with different configurations some with maui and some with
> pbs_sched. We are facing a lot of problems some times jobs are not
> running and some times jobs die.
Hmm, that's not much fun at all!
Some more information would be useful to get a better idea
of what's going on:
0) Which versions of Torque and which versions of Maui are you running ?
1) When a job doesn't run, what do qstat -f and checkjob -v say about why it's not running ?
2) Can you include the output of tracejob for the job ID that just failed ?
3) Can you go into a bit more detail about what happens when a job fails ?
> Would you please help us to find the best stable version of
> torque so we can install it in all our production clusters
> to eliminate the problems that could appears because of
> unstable versions?
For Torque, I'd say that the current stable version is still
2.1.10, though hopefully 2.3.1 should be reasonably good too
when it appears in the near future!
The most recent snapshot of Maui was released on the 4th June
and (from my brief scan of the Mauiusers mailing list) included
quite a few contributed patches (not surprising given it was
almost a year since the last previous snapshot!).
We don't run Maui, I'm afraid, but you may find that for an
organisation of your size investing in Moab might be better
as you'll get access to a heap of diagnostic tools plus you
can log support calls with the team.
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torqueusers