From yury at shurup.com Mon Oct 1 11:11:32 2012 From: yury at shurup.com (Yury V. Zaytsev) Date: Mon, 01 Oct 2012 19:11:32 +0200 Subject: [Mauiusers] Maui 3.3.1 segfaults on MPBSNodeUpdate In-Reply-To: References: Message-ID: <1349111492.6035.28.camel@newpride> On Mon, 2012-08-13 at 21:17 +0200, Marco Perosa wrote: > > > I think some size limit of one of the values involved is responsible, > but I'm not sure what would be the right way to avoid this problem. Hi Marco, I have reported exactly the same problem back in May: http://www.supercluster.org/pipermail/mauiusers/2012-May/004913.html Unfortunately, there seemed to be a problem with Maui user list. Only today I started getting e-mail from it that apparently has been queued up for ages. Have you found a solution yet? It's still segfaulting for me :-/ Interestingly, I observed that after I restart TORQUE and then restart Maui again, the segfaults happen much less frequently, but I couldn't identify the reason for it so far... -- Sincerely yours, Yury V. Zaytsev From basv at sara.nl Tue Oct 2 01:26:26 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 2 Oct 2012 09:26:26 +0200 Subject: [Mauiusers] Maui/Torque with node properties In-Reply-To: <00DF894B-F5A2-4F73-BD12-1BFC0C6A562A@grs-sim.de> References: <00DF894B-F5A2-4F73-BD12-1BFC0C6A562A@grs-sim.de> Message-ID: <506A9722.20106@sara.nl> On 09/10/2012 08:40 PM, Suraj Prabhakaran wrote: > Dear all, > > I have 4 nodes with the following properties > > node1 fast > node2 fast > node3 slow > node4 slow > > Traditionally, torque allows to request nodes with different properties by > > qsub -l nodes=1:fast+1:slow > > The above should allocate one fast node and one slow node and this works perfectly fine when pbs_sched is used. > > But when I use maui as my scheduler, I never get the nodes assigned and end up waiting infinitely. > > Is this feature supported in maui? Until now, I haven't read anywhere that this feature is not supported in maui. > Or, am I just missing something here? > Suraj, Just received this email today. Did you set this in maui.cfg: {{{ ENABLEMULTIREQJOBS TRUE }}} -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20121002/7403ba16/attachment.bin From basv at sara.nl Tue Oct 2 01:29:47 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 2 Oct 2012 09:29:47 +0200 Subject: [Mauiusers] Torque Maui Communication during job submission In-Reply-To: References: Message-ID: <506A97EB.6090905@sara.nl> On 08/16/2012 03:56 PM, Suraj Prabhakaran wrote: > Hello, > > I have been looking into torque and maui communication for some days. I have a question regarding job submission. > During a qsub command, does Maui get the information about the qsub only from torque or does it also get directly from the client? > > Again, any pointers to torque-maui documentation with more descriptions could be very helpful! > Maui gets it information from pbs_server and parse the qstat -f output, see: * src/moab/MPBSI.c -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20121002/bafc6a4e/attachment.bin From basv at sara.nl Tue Oct 2 01:32:25 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 2 Oct 2012 09:32:25 +0200 Subject: [Mauiusers] -l nodes=X isn't honored In-Reply-To: <5012B435.8010709@lgc.com> References: <5012B435.8010709@lgc.com> Message-ID: <506A9889.1050909@sara.nl> On 07/27/2012 05:31 PM, Steve Angelovich wrote: > We are having a problem when users request to run a job on multiple > nodes using a syntax such as > > -l nodes=4:rr:ppn=1 > > Our nodes have 8 processors so the scheduler is running the job on a > single node instead of 4. > > I've enable the option below in maui.cfg file; > > ## specifies whether or not the scheduler will allow jobs to specify > ## multiple independent resource requests > ## (i.e., pbs jobs with resource specifications such as '-l > nodes=3:fast+1:io') > ENABLEMULTIREQJOBS TRUE > > Are there any other settings I need to change to enable this behavior? > > thanks for any help. > > Steve > Steve, Did you set: * JOBNODEMATCHPOLICY EXACTNODE -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20121002/93c0a3c0/attachment.bin From basv at sara.nl Tue Oct 2 02:12:10 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 2 Oct 2012 10:12:10 +0200 Subject: [Mauiusers] policy not working as expected? In-Reply-To: References: Message-ID: <506AA1DA.4090801@sara.nl> On 03/05/2012 07:48 PM, Brandon Sawyers wrote: > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SINGLEJOB Brandon, We have the same settings and evertyhing works as expected. I have this line in my maui.cfg: {{{ # # Do not remove this settings else the scheduling goes wrong, # despite wat the MAUI manual says BvdV # NODESETPRIORITYTYPE BESTRESOURCE }}} -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20121002/abab8876/attachment-0001.bin From jpeltier at sfu.ca Tue Oct 2 02:24:34 2012 From: jpeltier at sfu.ca (James A. Peltier) Date: Tue, 2 Oct 2012 01:24:34 -0700 (PDT) Subject: [Mauiusers] Maui/Torque with node properties In-Reply-To: <00DF894B-F5A2-4F73-BD12-1BFC0C6A562A@grs-sim.de> Message-ID: <1699892418.28376862.1349166274496.JavaMail.root@jaguar10.sfu.ca> check out ENABLEMULTIREQJOBS TRUE and JOBNODEMATCHPOLICY EXACTNODE ----- Original Message ----- | Dear all, | | I have 4 nodes with the following properties | | node1 fast | node2 fast | node3 slow | node4 slow | | Traditionally, torque allows to request nodes with different | properties by | | qsub -l nodes=1:fast+1:slow | | The above should allocate one fast node and one slow node and this | works perfectly fine when pbs_sched is used. | | But when I use maui as my scheduler, I never get the nodes assigned | and end up waiting infinitely. | | Is this feature supported in maui? Until now, I haven't read anywhere | that this feature is not supported in maui. | Or, am I just missing something here? | | Best, | Suraj | _______________________________________________ | mauiusers mailing list | mauiusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/mauiusers | -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier Success is to be measured not so much by the position that one has reached in life but as by the obstacles they have overcome. - Booker T. Washington From bdandrus at nps.edu Wed Oct 3 12:47:26 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 3 Oct 2012 18:47:26 +0000 Subject: [Mauiusers] using GPU count in PRIORITYF Message-ID: All, I am currently using: NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='SPEED + .01 * AMEM - 10 * JOBCOUNT' But what I would like to add is something like: NODECFG[DEFAULT] PRIORITYF='SPEED + .01 * AMEM - 10 * JOBCOUNT - 10* GPUS' Only there doesn't seem to be any variable that will tell how many GPUs are on a node that can be used in this context. Or is there? Anyone know of one or a workaround with the same net result? Thanks in advance, Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From s.prabhakaran at grs-sim.de Wed Oct 3 17:50:16 2012 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Thu, 04 Oct 2012 01:50:16 +0200 Subject: [Mauiusers] Maui/Torque with node properties In-Reply-To: <1699892418.28376862.1349166274496.JavaMail.root@jaguar10.sfu.ca> References: <1699892418.28376862.1349166274496.JavaMail.root@jaguar10.sfu.ca> Message-ID: <43A24DFF-7A40-4C12-95B8-5118C0502A80@grs-sim.de> Thank you! It works! On Oct 2, 2012, at 10:24 AM, James A. Peltier wrote: > check out ENABLEMULTIREQJOBS TRUE and JOBNODEMATCHPOLICY EXACTNODE > > ----- Original Message ----- > | Dear all, > | > | I have 4 nodes with the following properties > | > | node1 fast > | node2 fast > | node3 slow > | node4 slow > | > | Traditionally, torque allows to request nodes with different > | properties by > | > | qsub -l nodes=1:fast+1:slow > | > | The above should allocate one fast node and one slow node and this > | works perfectly fine when pbs_sched is used. > | > | But when I use maui as my scheduler, I never get the nodes assigned > | and end up waiting infinitely. > | > | Is this feature supported in maui? Until now, I haven't read anywhere > | that this feature is not supported in maui. > | Or, am I just missing something here? > | > | Best, > | Suraj > | _______________________________________________ > | mauiusers mailing list > | mauiusers at supercluster.org > | http://www.supercluster.org/mailman/listinfo/mauiusers > | > > -- > James A. Peltier > Manager, IT Services - Research Computing Group > Simon Fraser University - Burnaby Campus > Phone : 778-782-6573 > Fax : 778-782-3045 > E-Mail : jpeltier at sfu.ca > Website : http://www.sfu.ca/itservices > http://blogs.sfu.ca/people/jpeltier > > Success is to be measured not so much by the position that one has reached > in life but as by the obstacles they have overcome. - Booker T. Washington -------------------------- Suraj Prabhakaran German Research School for Simulation Sciences GmbH Laboratory for Parallel Progreamming 52062 Aachen | Germany Tel +49 241 80 99743 Fax +49 241 80 92742 EMail s.prabhakaran at grs-sim.de Web www.grs-sim.de Members: Forschungszentrum J?lich GmbH | RWTH Aachen University Registered in the commercial register of the local court of D?ren (Amtsgericht D?ren) under registration number HRB 5268 Registered office: J?lich Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121004/df51de1c/attachment.html From basv at sara.nl Thu Oct 4 02:40:27 2012 From: basv at sara.nl (Bas van der Vlies) Date: Thu, 4 Oct 2012 10:40:27 +0200 Subject: [Mauiusers] [BUG] showstats segfaults In-Reply-To: <20120512063220.8c8117b7.bircoph@gmail.com> References: <20120511185204.cb043e72.bircoph@gmail.com> <20120512063220.8c8117b7.bircoph@gmail.com> Message-ID: <506D4B7B.1010100@sara.nl> Andrew, I have applied your patch against the trunk version: checkout: * svn://opensvn.adaptivecomputing.com/maui/trunk On 05/12/2012 04:32 AM, Andrew Savchenko wrote: > Hello, > > On Fri, 11 May 2012 18:52:04 +0400 Andrew Savchenko wrote: >> I use maui-3.3.1. showstats with no arguments or with -s/-v argument >> segfaults, in both cases gdb backtrace is the same: >> >> Program received signal SIGSEGV, Segmentation fault. >> 0x00007ffff7535206 in __rawmemchr_sse2 () from /lib64/libc.so.6 >> (gdb) bt >> #0 0x00007ffff7535206 in __rawmemchr_sse2 () from /lib64/libc.so.6 >> #1 0x00007ffff751f570 in _IO_str_init_static_internal () from /lib64/libc.so.6 >> #2 0x00007ffff750e0f5 in __isoc99_vsscanf () from /lib64/libc.so.6 >> #3 0x00007ffff750e088 in __isoc99_sscanf () from /lib64/libc.so.6 >> #4 0x0000000000406e35 in MCShowSchedulerStatistics ( >> Buffer=0x22fd15f "1336747509 0 101 0 0 0 16 124 514144 16 124 514144 40 40 16946.078789 5.612000 917.944445 252951046507.442596 0.000000 0 0 248.240124 0.112964 0.000000 0 30 0.000000 6 0.016667 0.000000 0 1.075000 2 3"...) at omclient.c:3773 >> #5 0x000000000040fea8 in main (argc=, argv=) at mclient.c:510 >> (gdb) >> >> Maui was compiled with CFLAGS="-march=core2 -O2 -ggdb". > > 1) This bug occurs only when compiled with any non-zero optimization > level: with -O0 it works, with -O1 and higher it fails as above. This > is a good hint of some memory misalignment or misuse in the code, > because -O1 optimization level is stable and safe. My compiler is > gcc-4.5.3 and my system is Gentoo entirely built with this compile > with even more aggressive options without any trouble. > > 2) The problem was in bad server reply, which was mishandled by > client's parser. > > In case of normal reply server returns two string in the ARG field > for a CMD=showstat request: > > "CK=1007dd0424073223 TS=1336781650 AUTH=root DT=SC=1 ARG=1336781623\n1336781343 1 27970 0 0 0 16 124 514144 16 124 514144 40 40 18071.720111 5.612000 917.944445 269753236437.359100 0.000000 0 0 2"... > > In a corrupted reply first string was omitted, e.g.: > > "CK=1007dd0424073223 TS=1336781650 AUTH=root DT=SC=1 ARG=1336781343 1 27970 0 0 0 16 124 514144 16 124 514144 40 40 18071.720111 5.612000 917.944445 269753236437.359100 0.000000 0 0 2"... > > I found that problem lies in the moab/MSched.c in the function > MSchedStatToString(): sprintf on line 4104 uses Buf string as both > destination and an argument. This is wrong and must be avoided, > because man sprintf says: > > However, the standards explicitly note that the results are undefined > if source and destination buffers overlap when calling sprintf() > > Attached patch fixes this issue by joining two sprintf calls into a > single one without buffer overlaps. > > Best regards, > Andrew Savchenko > > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers > -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20121004/3411643b/attachment.bin From hibbitts at gmail.com Mon Oct 1 22:45:35 2012 From: hibbitts at gmail.com (David Hibbitts) Date: Tue, 2 Oct 2012 00:45:35 -0400 Subject: [Mauiusers] Multi-Dimensional Throttling Policies not working Message-ID: Hi Everyone, maui client version 3.3.1 and here's a relevant portion of my maui.cfg file: FSPOLICY DEDICATEDPS FSDEPTH 4 FSINTERVAL 0:01:00 FSDECAY 0.01 FSWEIGHT 100000 FSUSERWEIGHT 1 USERCFG[DEFAULT] FSTARGET=90 USERCFG[loveless] FSTARGET=80 USERCFG[ecd4bd] FSTARGET=92 FSCLASSWEIGHT 100 CLASSCFG[long] FSTARGET=100 CLASSCFG[short] FSTARGET=0 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html CLASSCFG[long] MAXPROC[USER]=192 Problem is, that last line doesn't appear to be having an effect. For instance, one user is at 300 processors at the moment (all in the long queue) and none of his idle jobs are blocked. Any ideas? Thanks, David Hibbitts -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121002/578d0f47/attachment-0001.html From s.prabhakaran at grs-sim.de Sat Oct 6 17:28:32 2012 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Sun, 07 Oct 2012 01:28:32 +0200 Subject: [Mauiusers] Restricting allocation of certain node types Message-ID: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> Dear all, Is there a way to tell maui not to allocate certain "type" of nodes unless and until it has been asked for? For example, I have four nodes node1 np=4 slow node2 np=4 slow node3 np=4 fast node4 np=4 fast Here, I would like to have maui allocate only the "slow" nodes by default. If the slow nodes are not available and a new job with a simple request "-l nodes=1" comes up, it should be queued rather than having a fast node allocated for it. However, of course if the "fast" job is explicitly asked for, then it can be scheduled. That is "-l nodes=1:fast" should be accepted and allocated one of the free fast nodes. Is there a way to do this? Thanks, Suraj From bunk at physik.hu-berlin.de Mon Oct 8 03:03:27 2012 From: bunk at physik.hu-berlin.de (Burkhard Bunk) Date: Mon, 8 Oct 2012 11:03:27 +0200 (CEST) Subject: [Mauiusers] [torqueusers] Restricting allocation of certain node types In-Reply-To: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> References: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> Message-ID: Hi, as a simple solution, you may try configure the default settings in torque (not maui) with something like resources_default.nodes = slow or resources_default.nodes = 1:slow either at queue level or even for the server as a whole. These can be overriden by qsub options (in contrast to settings of "neednodes", which are mandatory for the users). I haven't tried this with node properties so far, but I know that "resources_default.nodes = 1:ppn=4" provides a default allocation which can be changed by explicit user commands. Regards, Burkhard Bunk. ---------------------------------------------------------------------- bunk at physik.hu-berlin.de Physics Institute, Humboldt University fax: ++49-30 2093 7628 Newtonstr. 15 phone: ++49-30 2093 7980 12489 Berlin, Germany ---------------------------------------------------------------------- On Sun, 7 Oct 2012, Suraj Prabhakaran wrote: > Dear all, > > Is there a way to tell maui not to allocate certain "type" of nodes unless and until it has been asked for? > > For example, I have four nodes > > node1 np=4 slow > node2 np=4 slow > node3 np=4 fast > node4 np=4 fast > > Here, I would like to have maui allocate only the "slow" nodes by default. If the slow nodes are not available and a new job with a simple request "-l nodes=1" comes up, it should be queued rather than having a fast node allocated for it. However, of course if the "fast" job is explicitly asked for, then it can be scheduled. That is "-l nodes=1:fast" should be accepted and allocated one of the free fast nodes. > > Is there a way to do this? > > Thanks, > Suraj > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From jbristow at adaptivecomputing.com Mon Oct 8 12:41:40 2012 From: jbristow at adaptivecomputing.com (Jared Bristow) Date: Mon, 8 Oct 2012 12:41:40 -0600 Subject: [Mauiusers] mailinglists are working again Message-ID: Jared Bristow | IT Manager : Adaptive Computing www.adaptivecomputing.com Direct Line: 801-717-3718 | Fax: 801-717-3738 1712 S. East Bay Blvd. Suite #300 Provo, UT 84606 All, We recently moved the network on one of our servers which broke the mail configuration for employees on all the mailing lists. I believe emails were still going out to all list members except for anyone with an adaptivecomputing.com or clusterresources.com email address. This also included list administrators not being notified about posts that needed to be moderated. As of this morning, the issue has been fixed. I also went through and approved any messages that were still awaiting moderator approval. I also updated the "list run by" address on each of the list info pages, and the contact info on the main list overview page: http://www.supercluster.org/mailman/listinfo Sorry for any inconvenience. If you have any more trouble in the future, please contact me at mailinglists at adaptivecomputing.com Jared Bristow | IT Manager : Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121008/bcf29d15/attachment.html From basv at sara.nl Thu Oct 11 00:31:11 2012 From: basv at sara.nl (Bas van der Vlies) Date: Thu, 11 Oct 2012 06:31:11 +0000 Subject: [Mauiusers] [torqueusers] Maui is not submitting jobs to torque In-Reply-To: <5075CCE2.1090300@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> Message-ID: <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> On 10 okt. 2012, at 21:30, Alessandra Forti > wrote: node is blocked by reservation sft.0.0 in INFINITY This message informs that there is a reservation on the node. what is the output of showres? -- Bas van der Vlies basv at sara.nl From listsarnau at gmail.com Thu Oct 11 03:17:18 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Thu, 11 Oct 2012 11:17:18 +0200 Subject: [Mauiusers] NODEMEMOVERCOMMITFACTOR and new mem values Message-ID: <20121011111718.5db4f204@amarrosa.pic.es> Hi all, we've specified NODEMEMOVERCOMMITFACTOR 2.0 From Moab doc (I did not find it in maui doc): "The parameter overcommits available and configured memory and swap on a node by the specified factor (for example: mem/swap * factor)." in one of our nodes: # free -m total used free shared buffers cached Mem: 32150 32040 110 0 34 11738 -/+ buffers/cache: 20267 11882 Swap: 32149 57 32092 checknode td713.pic.es checking node td713.pic.es State: Busy (in current state for 00:22:58) Configured Resources: PROCS: 17 MEM: 62G SWAP: 125G DISK: 440G MEM seems ok (32GB*2=64GB more or less those 62) but SWAP? why 125? is it mem+swap*2? Anyone who is using that parameter could confirm if they see same behaviour as me? # rpm -qa|grep maui maui-client-3.3-4.el6.x86_64 maui-devel-3.3-4.el6.x86_64 maui-server-3.3-4.el6.x86_64 TIA, Arnau From basv at sara.nl Thu Oct 11 07:33:24 2012 From: basv at sara.nl (Bas van der Vlies) Date: Thu, 11 Oct 2012 15:33:24 +0200 Subject: [Mauiusers] [torqueusers] Maui is not submitting jobs to torque In-Reply-To: References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> Message-ID: <5076CAA4.6040506@sara.nl> On 10/11/2012 01:36 PM, Jonathan Barber wrote: > On 10 October 2012 20:30, Alessandra Forti wrote: >> Hi, > > [snip] > >> 10/10 13:37:39 MRMCheckEvents() >> 10/10 13:37:39 INFO: no PBS sched socket connections ready >> 10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP) >> 10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily >> unavailable) >> 10/10 13:37:39 INFO: all clients connected. servicing requests >> >> which leaves me perplexed since in other places with a different log level >> it sees the jobs waiting on the server so somehow some comunication happens >> and other doesn't > > I see these same messages from maui 3.3.1. It is probably not a > problem for you, but it I believe it is a small bug in Maui. > > The problem is that the socket has the flag O_NONBLOCK set. However, > when the MSUAcceptClient() call's accept() to see if a client is > connecting, it doesn't take this into account. > > I've attached a patch for the attention of the maintainers. It > quietens the output and works for the 3.3.1 branch and applies cleanly > against the trunk (so it also work there). > > Regards > > Patch applied to maui trunk -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20121011/f8bd2b09/attachment.bin From ianm at uchicago.edu Sat Oct 13 11:55:01 2012 From: ianm at uchicago.edu (Ian Miller) Date: Sat, 13 Oct 2012 17:55:01 +0000 Subject: [Mauiusers] Maui not scheduling jobs correctly Message-ID: <843FE493E7B6CA42A6C4682D63AE2D9502021F91@XM-MBX-02-PROD.ad.uchicago.edu> Maui Users, I have a 36 Node systems with 8 to 12 cores in each node and have users submit up to 1000 jobs at a time. The issue is that the cluster will only run about 90 of them at a time with 85 out of the 344 processors active and only 12 of the nodes being used. The jobs are all going to the default queue and there are no node or cpu/memory parameters added to the submissions. Each job runs for about 20 min. Any ideas on how to get maui to schedule more jobs to the nodes? Torque 2.5.7 Maui 3.3.1 Ian Miller Research Computing Administrator ianm at uchicago.edu (312) 402-6170 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121013/566c84ae/attachment.html From jpeltier at sfu.ca Sat Oct 13 11:57:54 2012 From: jpeltier at sfu.ca (James A. Peltier) Date: Sat, 13 Oct 2012 10:57:54 -0700 (PDT) Subject: [Mauiusers] Maui not scheduling jobs correctly In-Reply-To: <843FE493E7B6CA42A6C4682D63AE2D9502021F91@XM-MBX-02-PROD.ad.uchicago.edu> Message-ID: <2131578995.49001623.1350151074843.JavaMail.root@jaguar10.sfu.ca> ----- Original Message ----- | Maui Users, | I have a 36 Node systems with 8 to 12 cores in each node and have | users submit up to 1000 jobs at a time. The issue is that the | cluster will only run about 90 of them at a time with 85 out of the | 344 processors active and only 12 of the nodes being used. The jobs | are all going to the default queue and there are no node or | cpu/memory parameters added to the submissions. | Each job runs for about 20 min. | Any ideas on how to get maui to schedule more jobs to the nodes? | Torque 2.5.7 | Maui 3.3.1 | Ian Miller | Research Computing Administrator | ianm at uchicago.edu | (312) 402-6170 It may help to post your maui.cfg file -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier Success is to be measured not so much by the position that one has reached in life but as by the obstacles they have overcome. - Booker T. Washington -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121013/99f6686f/attachment-0001.html From cwebber at ucr.edu Sat Oct 13 15:12:20 2012 From: cwebber at ucr.edu (Christopher Webber) Date: Sat, 13 Oct 2012 14:12:20 -0700 Subject: [Mauiusers] Maui not scheduling jobs correctly In-Reply-To: <2131578995.49001623.1350151074843.JavaMail.root@jaguar10.sfu.ca> References: <2131578995.49001623.1350151074843.JavaMail.root@jaguar10.sfu.ca> Message-ID: <711973BB-5A0C-442E-A457-10200E0B7541@ucr.edu> It may also be useful to see what the checkjob command says on the jobs that are not running. Also, you will want to look at your torque config. It is possible there are limits at the torque level. -- cwebber Christopher Webber - Systems Administrator Bioinformatics - University of California, Riverside Twitter: @cwebber Tel: 951.867.7108 http://cwebber.ucr.edu On Oct 13, 2012, at 10:57 AM, "James A. Peltier" wrote: > > Maui Users, > I have a 36 Node systems with 8 to 12 cores in each node and have users submit up to 1000 jobs at a time. The issue is that the cluster will only run about 90 of them at a time with 85 out of the 344 processors active and only 12 of the nodes being used. The jobs are all going to the default queue and there are no node or cpu/memory parameters added to the submissions. > Each job runs for about 20 min. > > Any ideas on how to get maui to schedule more jobs to the nodes? > Torque 2.5.7 > Maui 3.3.1 > > > Ian Miller > Research Computing Administrator > ianm at uchicago.edu > (312) 402-6170 > > It may help to post your maui.cfg file > -- > James A. Peltier > Manager, IT Services - Research Computing Group > Simon Fraser University - Burnaby Campus > Phone : 778-782-6573 > Fax : 778-782-3045 > E-Mail : jpeltier at sfu.ca > Website : http://www.sfu.ca/itservices > http://blogs.sfu.ca/people/jpeltier > > Success is to be measured not so much by the position that one has reached > in life but as by the obstacles they have overcome. - Booker T. Washington > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121013/0f194858/attachment.html From nt_mahmood at yahoo.com Sun Oct 14 02:22:25 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sun, 14 Oct 2012 01:22:25 -0700 (PDT) Subject: [Mauiusers] [torqueusers] problem with torque 4.1 (Cannot connect to default server) In-Reply-To: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> References: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> Message-ID: <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> We stuck at this point. Any tip is welcomed. ? Regards, Mahmood ----- Original Message ----- From: Mahmood Naderan To: torque cluster Cc: Sent: Friday, October 12, 2012 8:54 PM Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) Dear all, Below is our procedure to configure torque 4.1. However at the end we got an error. 1- compile and install torque 2- put the server hostname (archie) in the /var/spool/torque/server_name 3- put the server hostname in the /var/spool/torque/server_priv/nodes file. This is a shared memeory machine so the server and client are the same machine 4- run the command "pbs_server -t create" to setup the pbs_server 5- run the command "qmgr -c 'p s'" and the following error is reported Error communicating with archie(192.160.1.100) Cannot connect to default server host 'archie' - check pbs_server daemon and/or trqauthd. qmgr: cannot connect to server? (errno=111) Connection refused 6- starting trqauthd does not change the state and the error is reported again on submitting qmgr -c 'p s' Any comment is appreciated. Regards, Mahmood _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From ianm at uchicago.edu Sun Oct 14 11:29:48 2012 From: ianm at uchicago.edu (Ian Miller) Date: Sun, 14 Oct 2012 17:29:48 +0000 Subject: [Mauiusers] Maui not scheduling jobs correctly In-Reply-To: <2131578995.49001623.1350151074843.JavaMail.root@jaguar10.sfu.ca> Message-ID: <843FE493E7B6CA42A6C4682D63AE2D9502025E86@XM-MBX-02-PROD.ad.uchicago.edu> From: "James A. Peltier" > Date: Saturday, October 13, 2012 12:57 PM To: "mauiusers at supercluster.org" > Subject: Re: [Mauiusers] Maui not scheduling jobs correctly ________________________________ Maui Users, I have a 36 Node systems with 8 to 12 cores in each node and have users submit up to 1000 jobs at a time. The issue is that the cluster will only run about 90 of them at a time with 85 out of the 344 processors active and only 12 of the nodes being used. The jobs are all going to the default queue and there are no node or cpu/memory parameters added to the submissions. Each job runs for about 20 min. Any ideas on how to get maui to schedule more jobs to the nodes? Torque 2.5.7 Maui 3.3.1 Ian Miller Research Computing Administrator ianm at uchicago.edu (312) 402-6170 It may help to post your maui.cfg file -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier As requested. Maui.cfg # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' NODEAVAILABILITYPOLICY COMBINED:MEM SRCFG[Reinitz] HOSTLIST=minion1[2-9] SRCFG[Reinitz] GROUPLIST=Reinitz # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR - Ian Miller Ianm at uchicago.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121014/1ec58304/attachment.html From ianm at uchicago.edu Sun Oct 14 14:30:20 2012 From: ianm at uchicago.edu (Ian Miller) Date: Sun, 14 Oct 2012 20:30:20 +0000 Subject: [Mauiusers] Maui not scheduling jobs correctly In-Reply-To: <843FE493E7B6CA42A6C4682D63AE2D9502025E86@XM-MBX-02-PROD.ad.uchicago.edu> Message-ID: <843FE493E7B6CA42A6C4682D63AE2D9502026FA5@XM-MBX-02-PROD.ad.uchicago.edu> To the list members, Apologies to the list, my trackpad acted up and I fired off the email before I had put the whole file in. -I # maui.cfg 3.3.1 SERVERHOST beast # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[BEAST] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' NODEAVAILABILITYPOLICY COMBINED:MEM SRCFG[Reinitz] HOSTLIST=minion1[2-9] SRCFG[Reinitz] GROUPLIST=Reinitz # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Ian Miller Research Computing Administrator ianm at uchicago.edu (312) 402-6170 From: Ian Miller > Date: Sunday, October 14, 2012 12:29 PM To: "mauiusers at supercluster.org" > Subject: Re: [Mauiusers] Maui not scheduling jobs correctly From: "James A. Peltier" > Date: Saturday, October 13, 2012 12:57 PM To: "mauiusers at supercluster.org" > Subject: Re: [Mauiusers] Maui not scheduling jobs correctly ________________________________ Maui Users, I have a 36 Node systems with 8 to 12 cores in each node and have users submit up to 1000 jobs at a time. The issue is that the cluster will only run about 90 of them at a time with 85 out of the 344 processors active and only 12 of the nodes being used. The jobs are all going to the default queue and there are no node or cpu/memory parameters added to the submissions. Each job runs for about 20 min. Any ideas on how to get maui to schedule more jobs to the nodes? Torque 2.5.7 Maui 3.3.1 Ian Miller Research Computing Administrator ianm at uchicago.edu (312) 402-6170 It may help to post your maui.cfg file -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier As requested. Maui.cfg # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' NODEAVAILABILITYPOLICY COMBINED:MEM SRCFG[Reinitz] HOSTLIST=minion1[2-9] SRCFG[Reinitz] GROUPLIST=Reinitz # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR - Ian Miller Ianm at uchicago.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121014/5e266f3d/attachment-0001.html From wytsang at clustertech.com Mon Oct 15 01:31:33 2012 From: wytsang at clustertech.com (Clotho Tsang) Date: Mon, 15 Oct 2012 15:31:33 +0800 Subject: [Mauiusers] [torqueusers] problem with torque 4.1 (Cannot connect to default server) In-Reply-To: <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> References: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> Message-ID: You need to kill pbs_server with "kill -9". It does not shut down correctly. On 14 October 2012 16:22, Mahmood Naderan wrote: > We stuck at this point. Any tip is welcomed. > > > Regards, > Mahmood > > > > ----- Original Message ----- > From: Mahmood Naderan > To: torque cluster > Cc: > Sent: Friday, October 12, 2012 8:54 PM > Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default > server) > > Dear all, > > > Below is our procedure to configure torque 4.1. However at the end we got > an error. > > 1- compile and install torque > 2- put the server hostname (archie) in the /var/spool/torque/server_name > 3- put the server hostname in the /var/spool/torque/server_priv/nodes > file. This is a shared memeory machine so the server and client are the > same machine > 4- run the command "pbs_server -t create" to setup the pbs_server > 5- run the command "qmgr -c 'p s'" and the following error is reported > > Error communicating with archie(192.160.1.100) > Cannot connect to default server host 'archie' - check pbs_server daemon > and/or trqauthd. > qmgr: cannot connect to server (errno=111) Connection refused > > 6- starting trqauthd does not change the state and the error is reported > again on submitting qmgr -c 'p s' > > > Any comment is appreciated. > > Regards, > Mahmood > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: clotho at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121015/005b8acf/attachment.html From ianm at uchicago.edu Wed Oct 17 09:02:27 2012 From: ianm at uchicago.edu (Ian Miller) Date: Wed, 17 Oct 2012 15:02:27 +0000 Subject: [Mauiusers] performance issues with maui & torque Message-ID: <843FE493E7B6CA42A6C4682D63AE2D950202917C@XM-MBX-02-PROD.ad.uchicago.edu> Hi I have maui verison 3.3.1 and touque version 2.5.7 and I seem to have a few nodes sitting idle that should be running jobs. They have been able to run jobs in the past but the cluster has never run at 80-90% The output of showq is as follows (I omitted the jobs lists) 119 Active Jobs 130 of 344 Processors Active (37.79%) 15 of 35 Nodes Active (42.86%) Total Jobs: 467 Active Jobs: 119 Idle Jobs: 0 Blocked Jobs: 348 When I try to force run a job.. I get ?. root at beast$ qrun 209054 qrun: Execution server rejected request MSG=cannot send job to mom, state=PRERUN 209054.beast-net 30 out of the 34 worker nodes at in one queue (batch) with 2 out of the 30 shared between another queue. Currently 33 of the total jobs (467) are in a different queue (short) and are running fine, the reset are in the default(batch). My question is how can I get the idle nodes to run this jobs? What might be the problem? Qmgr: print queue batch # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 200 set queue batch resources_default.neednodes = batch set queue batch resources_default.nodes = 1 set queue batch max_user_run = 150 set queue batch keep_completed = 300 set queue batch enabled = True set queue batch started = True # maui.cfg 3.3.1 SERVERHOST beast # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[BEAST] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' NODEAVAILABILITYPOLICY COMBINED:MEM SRCFG[Reinitz] HOSTLIST=minion1[2-9] SRCFG[Reinitz] GROUPLIST=Reinitz # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html USERCFG[DEFAULT] MAXIJOB=2000 # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Ian Miller Research Computing Administrator ianm at uchicago.edu (312) 402-6170 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121017/5f19300f/attachment-0001.html From denismpa at gmail.com Wed Oct 17 12:26:27 2012 From: denismpa at gmail.com (Denis) Date: Wed, 17 Oct 2012 15:26:27 -0300 Subject: [Mauiusers] performance issues with maui & torque In-Reply-To: <843FE493E7B6CA42A6C4682D63AE2D950202917C@XM-MBX-02-PROD.ad.uchicago.edu> References: <843FE493E7B6CA42A6C4682D63AE2D950202917C@XM-MBX-02-PROD.ad.uchicago.edu> Message-ID: 2012/10/17 Ian Miller : > Hi > I have maui verison 3.3.1 and touque version 2.5.7 > and I seem to have a few nodes sitting idle that should be running jobs. > They have been able to run jobs in the past but the cluster has never run at > 80-90% > The output of showq is as follows (I omitted the jobs lists) > > 119 Active Jobs 130 of 344 Processors Active (37.79%) > > 15 of 35 Nodes Active (42.86%) > > Total Jobs: 467 Active Jobs: 119 Idle Jobs: 0 Blocked Jobs: 348 > > When I try to force run a job.. I get ?. > > root at beast$ qrun 209054 > > qrun: Execution server rejected request MSG=cannot send job to mom, > state=PRERUN 209054.beast-net > > 30 out of the 34 worker nodes at in one queue (batch) with 2 out of the 30 > shared between another queue. Currently 33 of the total jobs (467) are in > a different queue (short) and are running fine, the reset are in the > default(batch). My question is how can I get the idle nodes to run this > jobs? > > What might be the problem? > Try restarting the mom services at the empty nodes. > > > Qmgr: print queue batch > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch max_running = 200 > > set queue batch resources_default.neednodes = batch > > set queue batch resources_default.nodes = 1 > > set queue batch max_user_run = 150 > > set queue batch keep_completed = 300 > > set queue batch enabled = True > > set queue batch started = True > > > # maui.cfg 3.3.1 > > SERVERHOST beast > > # primary admin must be first in list > > ADMIN1 root > > # Resource Manager Definition > > RMCFG[BEAST] TYPE=PBS > > # Allocation Manager Definition > > AMCFG[bank] TYPE=NONE > > # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html > > # use the 'schedctl -l' command to display current configuration > > RMPOLLINTERVAL 00:00:30 > > SERVERPORT 42559 > > SERVERMODE NORMAL > > # Admin: http://supercluster.org/mauidocs/a.esecurity.html > > LOGFILE maui.log > > LOGFILEMAXSIZE 10000000 > > LOGLEVEL 3 > > # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html > > QUEUETIMEWEIGHT 1 > > # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > > #FSPOLICY PSDEDICATED > > #FSDEPTH 7 > > #FSINTERVAL 86400 > > #FSDECAY 0.80 > > # Throttling Policies: > http://supercluster.org/mauidocs/6.2throttlingpolicies.html > > # NONE SPECIFIED > > # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html > > NODEALLOCATIONPOLICY PRIORITY > NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' > NODEAVAILABILITYPOLICY COMBINED:MEM > > SRCFG[Reinitz] HOSTLIST=minion1[2-9] > SRCFG[Reinitz] GROUPLIST=Reinitz > > # QOS: http://supercluster.org/mauidocs/7.3qos.html > > # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > > # Standing Reservations: > http://supercluster.org/mauidocs/7.1.3standingreservations.html > > # SRSTARTTIME[test] 8:00:00 > # SRENDTIME[test] 17:00:00 > # SRDAYS[test] MON TUE WED THU FRI > # SRTASKCOUNT[test] 20 > # SRMAXTIME[test] 0:30:00 > > # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > > USERCFG[DEFAULT] MAXIJOB=2000 > # USERCFG[DEFAULT] FSTARGET=25.0 > # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > # CLASSCFG[batch] FLAGS=PREEMPTEE > # CLASSCFG[interactive] FLAGS=PREEMPTOR > > > > > > > > > Ian Miller > Research Computing Administrator > ianm at uchicago.edu > (312) 402-6170 > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers > -- Denis Anjos, www.versatushpc.com.br From denismpa at gmail.com Wed Oct 17 15:54:18 2012 From: denismpa at gmail.com (Denis) Date: Wed, 17 Oct 2012 18:54:18 -0300 Subject: [Mauiusers] performance issues with maui & torque In-Reply-To: <843FE493E7B6CA42A6C4682D63AE2D950202930B@XM-MBX-02-PROD.ad.uchicago.edu> References: <843FE493E7B6CA42A6C4682D63AE2D950202930B@XM-MBX-02-PROD.ad.uchicago.edu> Message-ID: 2012/10/17 Ian Miller : > Thx > That was the fix. > > Ian Miller > Research Computing Administrator > ianm at uchicago.edu > (312) 402-6170 > You're very welcome. D. > > > > > > On 10/17/12 1:26 PM, "Denis" wrote: > >>2012/10/17 Ian Miller : >>> Hi >>> I have maui verison 3.3.1 and touque version 2.5.7 >>> and I seem to have a few nodes sitting idle that should be running jobs. >>> They have been able to run jobs in the past but the cluster has never >>>run at >>> 80-90% >>> The output of showq is as follows (I omitted the jobs lists) >>> >>> 119 Active Jobs 130 of 344 Processors Active (37.79%) >>> >>> 15 of 35 Nodes Active (42.86%) >>> >>> Total Jobs: 467 Active Jobs: 119 Idle Jobs: 0 Blocked Jobs: 348 >>> >>> When I try to force run a job.. I get ?. >>> >>> root at beast$ qrun 209054 >>> >>> qrun: Execution server rejected request MSG=cannot send job to mom, >>> state=PRERUN 209054.beast-net >>> >>> 30 out of the 34 worker nodes at in one queue (batch) with 2 out of the >>>30 >>> shared between another queue. Currently 33 of the total jobs (467) are >>>in >>> a different queue (short) and are running fine, the reset are in the >>> default(batch). My question is how can I get the idle nodes to run this >>> jobs? >>> >>> What might be the problem? >>> >>Try restarting the mom services at the empty nodes. >>> >>> >>> Qmgr: print queue batch >>> >>> # Create queues and set their attributes. >>> >>> # >>> >>> # >>> >>> # Create and define queue batch >>> >>> # >>> >>> create queue batch >>> >>> set queue batch queue_type = Execution >>> >>> set queue batch max_running = 200 >>> >>> set queue batch resources_default.neednodes = batch >>> >>> set queue batch resources_default.nodes = 1 >>> >>> set queue batch max_user_run = 150 >>> >>> set queue batch keep_completed = 300 >>> >>> set queue batch enabled = True >>> >>> set queue batch started = True >>> >>> >>> # maui.cfg 3.3.1 >>> >>> SERVERHOST beast >>> >>> # primary admin must be first in list >>> >>> ADMIN1 root >>> >>> # Resource Manager Definition >>> >>> RMCFG[BEAST] TYPE=PBS >>> >>> # Allocation Manager Definition >>> >>> AMCFG[bank] TYPE=NONE >>> >>> # full parameter docs at >>>http://supercluster.org/mauidocs/a.fparameters.html >>> >>> # use the 'schedctl -l' command to display current configuration >>> >>> RMPOLLINTERVAL 00:00:30 >>> >>> SERVERPORT 42559 >>> >>> SERVERMODE NORMAL >>> >>> # Admin: http://supercluster.org/mauidocs/a.esecurity.html >>> >>> LOGFILE maui.log >>> >>> LOGFILEMAXSIZE 10000000 >>> >>> LOGLEVEL 3 >>> >>> # Job Priority: >>>http://supercluster.org/mauidocs/5.1jobprioritization.html >>> >>> QUEUETIMEWEIGHT 1 >>> >>> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html >>> >>> #FSPOLICY PSDEDICATED >>> >>> #FSDEPTH 7 >>> >>> #FSINTERVAL 86400 >>> >>> #FSDECAY 0.80 >>> >>> # Throttling Policies: >>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html >>> >>> # NONE SPECIFIED >>> >>> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html >>> >>> BACKFILLPOLICY FIRSTFIT >>> RESERVATIONPOLICY CURRENTHIGHEST >>> >>> # Node Allocation: >>>http://supercluster.org/mauidocs/5.2nodeallocation.html >>> >>> NODEALLOCATIONPOLICY PRIORITY >>> NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' >>> NODEAVAILABILITYPOLICY COMBINED:MEM >>> >>> SRCFG[Reinitz] HOSTLIST=minion1[2-9] >>> SRCFG[Reinitz] GROUPLIST=Reinitz >>> >>> # QOS: http://supercluster.org/mauidocs/7.3qos.html >>> >>> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB >>> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE >>> >>> # Standing Reservations: >>> http://supercluster.org/mauidocs/7.1.3standingreservations.html >>> >>> # SRSTARTTIME[test] 8:00:00 >>> # SRENDTIME[test] 17:00:00 >>> # SRDAYS[test] MON TUE WED THU FRI >>> # SRTASKCOUNT[test] 20 >>> # SRMAXTIME[test] 0:30:00 >>> >>> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html >>> >>> USERCFG[DEFAULT] MAXIJOB=2000 >>> # USERCFG[DEFAULT] FSTARGET=25.0 >>> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- >>> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi >>> # CLASSCFG[batch] FLAGS=PREEMPTEE >>> # CLASSCFG[interactive] FLAGS=PREEMPTOR >>> >>> >>> >>> >>> >>> >>> >>> >>> Ian Miller >>> Research Computing Administrator >>> ianm at uchicago.edu >>> (312) 402-6170 >>> >>> >>> _______________________________________________ >>> mauiusers mailing list >>> mauiusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/mauiusers >>> >> >> >> >>-- >>Denis Anjos, >>www.versatushpc.com.br > -- Denis Anjos, www.versatushpc.com.br From rf at q-leap.de Thu Oct 25 09:07:16 2012 From: rf at q-leap.de (rf at q-leap.de) Date: Thu, 25 Oct 2012 17:07:16 +0200 Subject: [Mauiusers] Maui Redistribution Message-ID: <20617.21924.521127.90384@gargle.gargle.HOWL> Hi, we would like to redistribute Maui as part of our Ubuntu/Debian-based HPC distribution Qlustar (www.qlustar.com). How can I get the needed permission? Thanks, Roland Q-Leap Networks From Alessandra.Forti at cern.ch Tue Oct 9 11:04:26 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Tue, 09 Oct 2012 17:04:26 -0000 Subject: [Mauiusers] Maui problem Message-ID: <50745914.3050601@cern.ch> Hi, I have installed a mini test cluster with torque and maui. We have used maui/torque for years on our grid cluster and now we are upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination maui doesn't seem to work correctly. When I submit jobs and it behaves as if there weren't any free resources. Even when I tried to install only torque and maui with a bare minimum configuration I got the same behaviour, i.e. 1) When I submit the jobs just remain queued //[root@// maui]# /qstat -an1// // //: // //Req'd Req'd Elap// //Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time// //-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----// //10. aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- // //11.s aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- / 2) If I run qrun the job runs so I assume the problem is not between torque server and torque mom. 3) When I use showq on the old versions displayed the WCLimit of the default queue now it displays 0 at first and then it changes it by itself to 100 days /[root@// maui]# showq// //ACTIVE JOBS--------------------// //JOBNAME USERNAME STATE PROC REMAINING STARTTIME// // // // 0 Active Jobs 0 of 16 Processors Active (0.00%)// // 0 of 1 Nodes Active (0.00%)// // //IDLE JOBS----------------------// //JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME// // //10 aforti Idle 1 99:23:59:59 Tue Oct 9 15:32:13// //11 aforti Idle 1 99:23:59:59 Tue Oct 9 16:39:09// // //2 Idle Jobs// // //BLOCKED JOBS----------------// //JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME// // // //Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked Jobs: 0// / 4) Checkjob just tells me the job cannot be run in the default partition without any particular reason /[.....] PE: 1.00 StartPriority: 120// //cannot select job 10 for partition DEFAULT (Class)/ 5) Checknode can see the node free if it wasn't clear from other commands /[root@// maui]# !checkno// //checknode // // //checking node // // //State: Idle (in current state for 00:55:10)// //Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK: 1M// //Utilized Resources: SWAP: 202M// //Dedicated Resources: [NONE]// //Opsys: linux Arch: [NONE]// //Speed: 1.00 Load: 0.000// //Network: [DEFAULT]// //Features: [lcgpro]// //Attributes: [Batch]// //Classes: [DEFAULT 1:1]// // //Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10 (0.09%)// // //Reservations:// //NOTE: no reservations on node/ 6) When I use showbf -v though it says my nodes are blocked by reservations despite checknode clearly telling me there are no reservations on that node. In our local maui.cfg there is a reservation for 1 proc I'm not sure why it blocks the whole node /[root@// server_logs]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:08:59// // // 3 procs available with no timelimit// // //node is blocked by reservation sft.0.0 in INFINITY// / But to be sure I removed it and even when I remove the reservation and reduce the maui.cfg to the default version without anything in it it tells me the node is blocked by "reservation NONE in INFINITY" /[root@// maui]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:37:58// // // 16 procs available with no timelimit// // //node is blocked by reservation NONE in INFINITY// / I'm not sure how to proceed because the log files don't tell me anything and all the references I have found to a similar problem have remained unanswered. Thanks for any help here are the rpms I used /maui-3.3-4.el5// //maui-client-3.3-4.el5// //maui-server-3.3-4.el5// //torque-2.5.7-7.el5// //torque-client-2.5.7-7.el5// //torque-server-2.5.7-7.el5// //libtorque-2.5.7-7.el5// / the maui.cfg /# # MAUI configuration example # @(#)maui.cfg David Groep 20031015.1 # for MAUI version 3.2.5 # SERVERHOST / /ADMIN1 root ADMINHOST / /RMTYPE[0] PBS RMHOST[0] / /RMSERVER[0] / / SERVERPORT 40559 SERVERMODE NORMAL # Set PBS server polling interval. Since we have many short jobs # and want fast turn-around, set this to 10 seconds (default: 2 minutes) RMPOLLINTERVAL 00:00:10 # a max. 10 MByte log file in a logical location LOGFILE /var/log/maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3/ and Torque config /create queue long// //set queue long queue_type = Execution// //set queue long acl_hosts = localhost// //set queue long acl_hosts += // //set queue long resources_max.cput = 48:00:00// //set queue long resources_max.walltime = 72:00:00// //set queue long acl_group_enable = True// //set queue long acl_groups = aforti// //set queue long enabled = True// //set queue long started = True// //#// //# Set server attributes.// //#// //set server scheduling = True// //set server acl_host_enable = False// //set server acl_hosts = // //set server acl_hosts += localhost// //set server default_queue = long// //set server log_events = 511// //set server mail_from = adm// //set server next_job_number = 12/ -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121009/de788aae/attachment-0001.html From Alessandra.Forti at cern.ch Wed Oct 10 06:57:40 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Wed, 10 Oct 2012 12:57:40 -0000 Subject: [Mauiusers] Maui problem In-Reply-To: <50745914.3050601@cern.ch> References: <50745914.3050601@cern.ch> Message-ID: <507570BF.6030506@cern.ch> Further information: if I increase the maui loglevel to 9 I hundreds of these messages /10/10 13:37:39 MRMCheckEvents()// //10/10 13:37:39 INFO: no PBS sched socket connections ready// //10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)// //10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily unavailable)// //10/10 13:37:39 INFO: all clients connected. servicing requests// / if I reduce it to 8 I get on top of other stuff /10/10 13:55:11 MJobCheckLimits(2,HARD,P,8,Message)// //10/10 13:55:11 INFO: job 2 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/10 13:55:11 INFO: total jobs selected in partition DEFAULT: 0/1 [Class: 1]// // /cheers alessandra On 09/10/2012 18:04, Alessandra Forti wrote: > Hi, > > I have installed a mini test cluster with torque and maui. We have > used maui/torque for years on our grid cluster and now we are > upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this new > combination maui doesn't seem to work correctly. When I submit jobs > and it behaves as if there weren't any free resources. Even when I > tried to install only torque and maui with a bare minimum > configuration I got the same behaviour, i.e. > > 1) When I submit the jobs just remain queued > > //[root@// maui]# /qstat -an1// > // > //: // > //Req'd Req'd Elap// > //Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time// > //-------------------- -------- -------- ---------------- ------ ----- > --- ------ ----- - -----// > //10. aforti long pbs-vm3.sh -- -- -- > -- -- Q -- -- // > //11.s aforti long pbs-vm3.sh -- -- -- > -- -- Q -- -- / > > 2) If I run qrun the job runs so I assume the problem is not > between torque server and torque mom. > 3) When I use showq on the old versions displayed the WCLimit of the > default queue now it displays 0 at first and then it changes it by > itself to 100 days > > /[root@// maui]# showq// > //ACTIVE JOBS--------------------// > //JOBNAME USERNAME STATE PROC REMAINING > STARTTIME// > // > // > // 0 Active Jobs 0 of 16 Processors Active (0.00%)// > // 0 of 1 Nodes Active (0.00%)// > // > //IDLE JOBS----------------------// > //JOBNAME USERNAME STATE PROC WCLIMIT > QUEUETIME// > // > //10 aforti Idle 1 99:23:59:59 Tue Oct 9 > 15:32:13// > //11 aforti Idle 1 99:23:59:59 Tue Oct 9 > 16:39:09// > // > //2 Idle Jobs// > // > //BLOCKED JOBS----------------// > //JOBNAME USERNAME STATE PROC WCLIMIT > QUEUETIME// > // > // > //Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked Jobs: 0// > / > 4) Checkjob just tells me the job cannot be run in the default > partition without any particular reason > > /[.....] > PE: 1.00 StartPriority: 120// > //cannot select job 10 for partition DEFAULT (Class)/ > > 5) Checknode can see the node free if it wasn't clear from other commands > > /[root@// maui]# !checkno// > //checknode // > // > //checking node // > // > //State: Idle (in current state for 00:55:10)// > //Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK: 1M// > //Utilized Resources: SWAP: 202M// > //Dedicated Resources: [NONE]// > //Opsys: linux Arch: [NONE]// > //Speed: 1.00 Load: 0.000// > //Network: [DEFAULT]// > //Features: [lcgpro]// > //Attributes: [Batch]// > //Classes: [DEFAULT 1:1]// > // > //Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10 (0.09%)// > // > //Reservations:// > //NOTE: no reservations on node/ > > 6) When I use showbf -v though it says my nodes are blocked by > reservations despite checknode clearly telling me there are no > reservations on that node. In our local maui.cfg there is a > reservation for 1 proc I'm not sure why it blocks the whole node > > /[root@// server_logs]# showbf -v// > //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct > 9 17:08:59// > // > // 3 procs available with no timelimit// > // > //node is blocked by reservation sft.0.0 in INFINITY// > / > But to be sure I removed it and even when I remove the reservation and > reduce the maui.cfg to the default version without anything in it it > tells me the node is blocked by "reservation NONE in INFINITY" > > /[root@// maui]# showbf -v// > //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct > 9 17:37:58// > // > // 16 procs available with no timelimit// > // > //node is blocked by reservation NONE in INFINITY// > / > I'm not sure how to proceed because the log files don't tell me > anything and all the references I have found to a similar problem have > remained unanswered. > > Thanks for any help here are the rpms I used > > /maui-3.3-4.el5// > //maui-client-3.3-4.el5// > //maui-server-3.3-4.el5// > //torque-2.5.7-7.el5// > //torque-client-2.5.7-7.el5// > //torque-server-2.5.7-7.el5// > //libtorque-2.5.7-7.el5// > / > the maui.cfg > > /# > # MAUI configuration example > # @(#)maui.cfg David Groep 20031015.1 > # for MAUI version 3.2.5 > # > SERVERHOST / > /ADMIN1 root > ADMINHOST / > /RMTYPE[0] PBS > RMHOST[0] / > /RMSERVER[0] / > / > SERVERPORT 40559 > SERVERMODE NORMAL > > # Set PBS server polling interval. Since we have many short jobs > # and want fast turn-around, set this to 10 seconds (default: 2 minutes) > RMPOLLINTERVAL 00:00:10 > > # a max. 10 MByte log file in a logical location > LOGFILE /var/log/maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3/ > > and Torque config > > /create queue long// > //set queue long queue_type = Execution// > //set queue long acl_hosts = localhost// > //set queue long acl_hosts += // > //set queue long resources_max.cput = 48:00:00// > //set queue long resources_max.walltime = 72:00:00// > //set queue long acl_group_enable = True// > //set queue long acl_groups = aforti// > //set queue long enabled = True// > //set queue long started = True// > //#// > //# Set server attributes.// > //#// > //set server scheduling = True// > //set server acl_host_enable = False// > //set server acl_hosts = // > //set server acl_hosts += localhost// > //set server default_queue = long// > //set server log_events = 511// > //set server mail_from = adm// > //set server next_job_number = 12/ > -- > Facts aren't facts if they come from the wrong people. (Paul Krugman) -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121010/951d65b0/attachment-0001.html From Alessandra.Forti at cern.ch Wed Oct 10 13:30:49 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Wed, 10 Oct 2012 19:30:49 -0000 Subject: [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <50745914.3050601@cern.ch> References: <50745914.3050601@cern.ch> Message-ID: <5075CCE2.1090300@cern.ch> Hi, I have installed a mini test cluster with torque and maui. We have used maui/torque for years on our grid cluster and now we are upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination maui doesn't seem to work correctly. When I submit jobs and it behaves as if there weren't any free resources. Even when I tried to install only torque and maui with a bare minimum configuration I got the same behaviour, i.e. 1) When I submit the jobs just remain queued //[root@// maui]# /qstat -an1// // //: // //Req'd Req'd Elap// //Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time// //-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----// //10. aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- // //11.s aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- / 2) If I run qrun the job runs so I assume the problem is not between torque server and torque mom. 3) When I use showq on the old versions displayed the WCLimit of the default queue now it displays 0 at first and then it changes it by itself to 100 days /showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 16 Processors Active (0.00%) 0 of 1 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 2 aforti Idle 1 99:23:59:59 Wed Oct 10 13:36:34 3 aforti Idle 1 99:23:59:59 Wed Oct 10 14:01:43 4 aforti Idle 1 99:23:59:59 Wed Oct 10 18:50:14 5 aforti Idle 1 00:00:00 Wed Oct 10 20:29:27 4 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 4 Active Jobs: 0 Idle Jobs: 4 Blocked Jobs: 0 // // //Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked Jobs: 0// / 4) Checkjob just tells me the job cannot be run in the default partition without any particular reason /[.....] PE: 1.00 StartPriority: 120// //cannot select job 10 for partition DEFAULT (Class)/ 5) Checknode can see the node free if it wasn't clear from other commands /[root@// maui]# !checkno// //checknode // // //checking node // // //State: Idle (in current state for 00:55:10)// //Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK: 1M// //Utilized Resources: SWAP: 202M// //Dedicated Resources: [NONE]// //Opsys: linux Arch: [NONE]// //Speed: 1.00 Load: 0.000// //Network: [DEFAULT]// //Features: [lcgpro]// //Attributes: [Batch]// //Classes: [DEFAULT 1:1]// // //Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10 (0.09%)// // //Reservations:// //NOTE: no reservations on node/ 6) When I use showbf -v though it says my nodes are blocked by reservations despite checknode clearly telling me there are no reservations on that node. In our local maui.cfg there is a reservation for 1 proc I'm not sure why it blocks the whole node /[root@// server_logs]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:08:59// // // 3 procs available with no timelimit// // //node is blocked by reservation sft.0.0 in INFINITY// / But to be sure I removed it and even when I remove the reservation and reduce the maui.cfg to the default version without anything in it it tells me the node is blocked by "reservation NONE in INFINITY" /[root@// maui]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:37:58// // // 16 procs available with no timelimit// // //node is blocked by reservation NONE in INFINITY// / If I increase the maui loglevel to 9 I hundreds of these messages /10/10 13:37:39 MRMCheckEvents()// //10/10 13:37:39 INFO: no PBS sched socket connections ready// //10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)// //10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily unavailable)// //10/10 13:37:39 INFO: all clients connected. servicing requests// / which leaves me perplexed since in other places with a different log level it sees the jobs waiting on the server so somehow some comunication happens and other doesn't /10/10 20:27:24 INFO: job '2' Priority: 410// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 410(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '2' priority: 410.30// //10/10 20:27:24 MJobGetStartPriority(3,0,Priority,NULL)// //10/10 20:27:24 INFO: job '3' Priority: 385// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 385(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '3' priority: 385.30// //10/10 20:27:24 MJobGetStartPriority(4,0,Priority,NULL)// //10/10 20:27:24 INFO: job '4' Priority: 97// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 97(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '4' priority: 97.17// / Thanks for any help here are the rpms I used /maui-3.3-4.el5// //maui-client-3.3-4.el5// //maui-server-3.3-4.el5// //torque-2.5.7-7.el5// //torque-client-2.5.7-7.el5// //torque-server-2.5.7-7.el5// //libtorque-2.5.7-7.el5// / the maui.cfg /# # MAUI configuration example # @(#)maui.cfg David Groep 20031015.1 # for MAUI version 3.2.5 # SERVERHOST / /ADMIN1 root ADMINHOST / /RMTYPE[0] PBS RMHOST[0] / /RMSERVER[0] / / SERVERPORT 40559 SERVERMODE NORMAL # Set PBS server polling interval. Since we have many short jobs # and want fast turn-around, set this to 10 seconds (default: 2 minutes) RMPOLLINTERVAL 00:00:10 # a max. 10 MByte log file in a logical location LOGFILE /var/log/maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3/ and Torque config /create queue long// //set queue long queue_type = Execution// //set queue long acl_hosts = localhost// //set queue long acl_hosts += // //set queue long resources_max.cput = 48:00:00// //set queue long resources_max.walltime = 72:00:00// //set queue long acl_group_enable = True// //set queue long acl_groups = aforti// //set queue long enabled = True// //set queue long started = True// //#// //# Set server attributes.// //#// //set server scheduling = True// //set server acl_host_enable = False// //set server acl_hosts = // //set server acl_hosts += localhost// //set server default_queue = long// //set server log_events = 511// //set server mail_from = adm// //set server next_job_number = 12/ -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121010/8e5bc7cc/attachment-0001.html From Alessandra.Forti at cern.ch Thu Oct 11 14:22:09 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Thu, 11 Oct 2012 20:22:09 -0000 Subject: [Mauiusers] [torqueusers] Maui is not submitting jobs to torque In-Reply-To: <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> Message-ID: <50772A66.2090503@cern.ch> Hi, yes, in one case there is a reservation and here is the output of that test installation /showres// //Reservations// // //ReservationID Type S Start End Duration N/P StartTime// // //sft.0.0 User - -1:12:07:14 INFINITY INFINITY 1/1 Wed Oct 10 09:01:02// // //1 reservation located/ but in the second installation I did removing all non essential there is no reservation and it still doesn't submit. /showres// //Reservations// // //ReservationID Type S Start End Duration N/P StartTime// // // //0 reservations located /I've now done the installation 3 times progressively simplifying and it always gives the same result. Here is another interesting snapshot of the log files where it tells me that the classes are not supported. It looks like a miscomunication with the pbs server but all the ports are opened. And some information passes through. /10/11 21:02:04 MLocalCheckFairnessPolicy(26,1349985724,Message)// //10/11 21:02:04 INFO: job '26' added to queue at slot 2// //10/11 21:02:04 INFO: total jobs selected in partition ALL: 3/3 // //10/11 21:02:04 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)// //10/11 21:02:04 MLocalCheckFairnessPolicy(NULL,1349985724,Message)// //10/11 21:02:04 INFO: checking job[0] '24'// //10/11 21:02:04 MJobCheckLimits(24,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 24 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[1] '25'// //10/11 21:02:04 MJobCheckLimits(25,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 25 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[2] '26'// //10/11 21:02:04 MJobCheckLimits(26,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 26 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: total jobs selected in partition DEFAULT: 0/3 [Class: 3]// //10/11 21:02:04 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)// //10/11 21:02:04 MLocalCheckFairnessPolicy(NULL,1349985724,Message)// //10/11 21:02:04 INFO: checking job[0] '24'// //10/11 21:02:04 MJobCheckLimits(24,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 24 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[1] '25'// //10/11 21:02:04 MJobCheckLimits(25,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 25 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[2] '26'// //10/11 21:02:04 MJobCheckLimits(26,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 26 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: total jobs selected in partition DEFAULT: 0/3 [Class: 3]/ thanks for any help. cheers alessandra PS I know there is an elephant in the room, I just can't see it. :( On 11/10/2012 07:31, Bas van der Vlies wrote: > On 10 okt. 2012, at 21:30, Alessandra Forti > wrote: > > node is blocked by reservation sft.0.0 in INFINITY > > This message informs that there is a reservation on the node. what is the output of showres? > > -- > Bas van der Vlies > basv at sara.nl > > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20121011/9ba9387f/attachment.html From delphine.ramalingom at univ-reunion.fr Thu Oct 25 22:08:34 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Fri, 26 Oct 2012 08:08:34 +0400 Subject: [Mauiusers] cannot accept client on PBS sched socket Message-ID: <508A0CC2.8000505@univ-reunion.fr> Hello, My question is about a message in the file maui.log : ALERT: cannot accept client on PBS sched socket Do you know why I have this message and how to finish with it ? My version of maui is 3.3.1 Thanks a lot, Delphine From mario.kadastik at cern.ch Wed Oct 31 12:28:47 2012 From: mario.kadastik at cern.ch (Mario Kadastik) Date: Wed, 31 Oct 2012 20:28:47 +0200 Subject: [Mauiusers] maui not scheduling on all nodes Message-ID: <66BD0F75-6C55-4128-8873-1BC507C91CB0@cern.ch> Hi, I'm having trouble with maui that's from EMI-1 repository. It namely tends to schedule only up to a certain amount of jobs and then doesn't schedule more jobs even though there are free slots. The maui log shows that it tries to schedule jobs, but fails to make reservations: 10/31 19:49:45 INFO: 162 PBS resources detected on RM base 10/31 19:49:45 INFO: resources detected: 162 10/31 19:49:45 MPBSWorkloadQuery(base,JCount,SC) 10/31 19:50:06 INFO: processing node request line '1' 10/31 19:50:06 INFO: job '1046246' loaded: 1 cms225 cms 259200 Idle 0 1351705749 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue] 1351705785 10/31 19:50:06 INFO: processing node request line '1' 10/31 19:50:06 INFO: job '1046247' loaded: 1 cms225 cms 259200 Idle 0 1351705750 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue] 1351705785 10/31 19:50:06 INFO: processing node request line '1' 10/31 19:50:06 INFO: job '1046248' loaded: 1 cms225 cms 259200 Idle 0 1351705752 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue] 1351705785 10/31 19:50:06 INFO: processing node request line '1' 10/31 19:50:06 INFO: job '1046249' loaded: 1 cms225 cms 259200 Idle 0 1351705756 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue] 1351705785 10/31 19:50:06 INFO: processing node request line '1' 10/31 19:50:06 INFO: job '1046250' loaded: 1 cms225 cms 259200 Idle 0 1351705770 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue] 1351705785 10/31 19:50:06 INFO: active PBS job 1041018 has been removed from the queue. assuming successful completion 10/31 19:50:06 INFO: active PBS job 1041187 has been removed from the queue. assuming successful completion 10/31 19:50:06 INFO: active PBS job 1044863 has been removed from the queue. assuming successful completion 10/31 19:50:06 INFO: active PBS job 1044890 has been removed from the queue. assuming successful completion 10/31 19:50:06 INFO: active PBS job 1044916 has been removed from the queue. assuming successful completion 10/31 19:50:06 INFO: active PBS job 1045212 has been removed from the queue. assuming successful completion 10/31 19:50:06 INFO: 4982 PBS jobs detected on RM base 10/31 19:50:06 INFO: jobs detected: 4982 10/31 19:50:07 INFO: total jobs selected (ALL): 848/4982 [State: 4134] 10/31 19:50:07 INFO: total jobs selected (ALL): 848/4982 [State: 4134] 10/31 19:50:07 INFO: total jobs selected in partition ALL: 848/848 10/31 19:50:07 INFO: total jobs selected in partition ALL: 848/848 10/31 19:50:07 INFO: total jobs selected in partition DEFAULT: 848/848 10/31 19:50:07 MRMJobStart(1045241,Msg,SC) 10/31 19:50:07 MPBSJobStart(1045241,base,Msg,SC) 10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,wn-v-4196.local) 10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,1) 10/31 19:50:07 INFO: job '1045241' successfully started 10/31 19:50:07 MRMJobStart(1045242,Msg,SC) 10/31 19:50:07 MPBSJobStart(1045242,base,Msg,SC) 10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,wn-v-6068.local) 10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,1) 10/31 19:50:07 INFO: job '1045242' successfully started 10/31 19:50:07 ERROR: cannot create reservation for job '1045242' 10/31 19:50:07 ERROR: cannot start job '1045242' in partition DEFAULT 10/31 19:50:07 MJobPReserve(1045242,DEFAULT,ResCount,ResCountRej) 10/31 19:50:07 ALERT: cannot create reservation in MJobReserve 10/31 19:50:07 MJobPReserve(1045243,DEFAULT,ResCount,ResCountRej) 10/31 19:50:07 ALERT: cannot create reservation in MJobReserve 10/31 19:50:07 MJobPReserve(1045244,DEFAULT,ResCount,ResCountRej) 10/31 19:50:07 ALERT: cannot create reservation in MJobReserve 10/31 19:50:07 MJobPReserve(1045245,DEFAULT,ResCount,ResCountRej) 10/31 19:50:07 ALERT: cannot create reservation in MJobReserve 10/31 19:50:07 MJobPReserve(1045247,DEFAULT,ResCount,ResCountRej) 10/31 19:50:07 ALERT: cannot create reservation in MJobReserve 10/31 19:50:07 MJobPReserve(1045246,DEFAULT,ResCount,ResCountRej) 10/31 19:50:07 ALERT: cannot create reservation in MJobReserve The queues show this: [root at torque-v-1 log]# qstat -q server: torque-v-1.local Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- test -- 01:00:00 02:00:00 -- 0 0 -- E R long -- 48:00:00 72:00:00 -- 4101 974 -- E R short -- 01:00:00 02:00:00 -- 2 0 -- E R ----- ----- 4103 974 [root at torque-v-1 log]# There are free slots however: [root at torque-v-1 log]# diagnose -t DEFAULT [test 5427:5427] All slots are configured for short and long queue (why they don't show up in diagnose -t is beyond me, but ...). Ideas are welcome. I've seen the scheduling to get stuck at around 3500-3700 running jobs, now after a maintenance downtime where the job count reached 0 this number seems to be around 4100-4300 jobs. I have seen 4930 running jobs a while ago, but that's not been possible recently. The maui is: [root at torque-v-1 log]# rpm -qa|grep maui maui-3.2.6p21-snap.1234905291.5.el5 maui-client-3.2.6p21-snap.1234905291.5.el5 maui-server-3.2.6p21-snap.1234905291.5.el5 PS! if you received this twice, sorry ... wasn't sure my original mail got through... Thanks in advance, Mario Kadastik, PhD Researcher --- "Physics is like sex, sure it may have practical reasons, but that's not why we do it" -- Richard P. Feynman