[torqueusers] Recommendation for production clusters

Bu Khamsin, Ahmed M ahmed.bukhamsin at aramco.com
Tue Jul 1 22:28:14 MDT 2008


Hi all,
Could you please help me with this problem please.
 I upgraded one cluster to torque 2.1.10 and maui version 3.2.6p19.
Now we have a problem with qrun, when ever we force a job to run, it only gets one node.
Is there a fix for this problem?

Regards,
Ahmed Bukhamsin
ECC, ENOD, CSYS, HPC team
+966 3 873-0852
________________________________
From: Bu Khamsin, Ahmed M
Sent: Sunday, June 29, 2008 11:27 AM
To: 'Chris Samuel'
Cc: torqueusers at supercluster.org
Subject: RE: [torqueusers] Recommendation for production clusters


Thank you for your replay,



Let me start with this problem,

I upgraded one cluster to torque 2.1.10 and maui version 3.2.6p19. We have problem with qrun, when ever we force a job to run it only gets one node. Here are some logs from server_log/



06/29/2008 11:01:10;0040;PBS_Server;Req;set_nodes;allocating nodes for job 993.plcb with node expression 'plcb020+plcb512+plcb511+plcb510+plcb509+plcb508+plcb507+plcb506+plcb505+plcb504'



06/29/2008 11:01:10;0040;PBS_Server;Req;set_nodes;job 993.plcb allocated 10 nodes (nodelist=plcb020/0+plcb512/0+plcb511/0+plcb510/0+plcb509/0+plcb508/0+plcb507/0+plcb506/0+plcb505/0+plcb504/0)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing nodes for job 993.plcb



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb020/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb504/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb507/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb508/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb511/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb505/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb506/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb509/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb510/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:10;0040;PBS_Server;Req;free_nodes;freeing node plcb512/0 from job 993.plcb (nsnfree=1)



06/29/2008 11:01:11;0040;PBS_Server;Req;set_nodes;allocating nodes for job 993.plcb with node expression 'plcb020+plcb512+plcb511+plcb510+plcb509+plcb508+plcb507+plcb506+plcb505+plcb504'



The above message repeated 10 times then I got this:



06/29/2008 11:01:20;0040;PBS_Server;Req;set_nodes;allocating nodes for job 993.plcb with node expression '10'



06/29/2008 11:01:20;0040;PBS_Server;Req;set_nodes;job 993.plcb allocated 1 nodes (nodelist=plcb006/0)



06/29/2008 11:01:49;0040;PBS_Server;Req;free_nodes;freeing nodes for job 993.plcb



06/29/2008 11:01:49;0040;PBS_Server;Req;free_nodes;freeing node plcb006/0 from job 993.plcb (nsnfree=1)



Regards,

Ahmed Bukhamsin

ECC, ENOD, CSYS, HPC team

+966 3 873-0852



-----Original Message-----
From: Chris Samuel [mailto:csamuel at vpac.org]
Sent: Friday, June 27, 2008 4:35 PM
To: Bu Khamsin, Ahmed M
Cc: Linux System Support; Raqa, Hussain A; Shaikh, Raed A; torqueusers at supercluster.org
Subject: Re: [torqueusers] Recommendation for production clusters





----- "Bu Khamsin, Ahmed M" <ahmed.bukhamsin at aramco.com> wrote:



> Hi all,



Hello Ahmed,



> In our data center we are running different versions of torque in our

> clusters with different configurations some with maui and some with

> pbs_sched. We are facing a lot of problems some times jobs are not

> running and some times jobs die.



Hmm, that's not much fun at all!



Some more information would be useful to get a better idea

of what's going on:



0) Which versions of Torque and which versions of Maui are you running ?



1) When a job doesn't run, what do qstat -f and checkjob -v say about why it's not running ?



2) Can you include the output of tracejob for the job ID that just failed ?



3) Can you go into a bit more detail about what happens when a job fails ?



> Would you please help us to find the best stable version of

> torque so we can install it in all our production clusters

> to eliminate the problems that could appears because of

> unstable versions?



For Torque, I'd say that the current stable version is still

2.1.10, though hopefully 2.3.1 should be reasonably good too

when it appears in the near future!



The most recent snapshot of Maui was released on the 4th June

and (from my brief scan of the Mauiusers mailing list) included

quite a few contributed patches (not surprising given it was

almost a year since the last previous snapshot!).



http://www.clusterresources.com/downloads/maui/temp/



We don't run Maui, I'm afraid, but you may find that for an

organisation of your size investing in Moab might be better

as you'll get access to a heap of diagnostic tools plus you

can log support calls with the team.



cheers!

Chris

--

Christopher Samuel - (03) 9925 4751 - Systems Manager

 The Victorian Partnership for Advanced Computing

 P.O. Box 201, Carlton South, VIC 3053, Australia

VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080702/fb91b6b3/attachment-0001.html


More information about the torqueusers mailing list