From basv at sara.nl Mon Oct 1 03:43:09 2012 From: basv at sara.nl (Bas van der Vlies) Date: Mon, 1 Oct 2012 11:43:09 +0200 Subject: [torqueusers] svn server unreachable Message-ID: <506965AD.3090605@sara.nl> to whom it concerns: {{{ root# svn ls svn://opensvn.adaptivecomputing.com/torque svn: Can't connect to host 'opensvn.adaptivecomputing.com': Network is unreachable }}} And another problem is that i can not use the search functionality. This server is also down: * http://www.clusterresources.com/pipermail/torquedev/2012-April/004044.html regards -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121001/f3742f86/attachment.bin From basv at sara.nl Mon Oct 1 07:56:32 2012 From: basv at sara.nl (Bas van der Vlies) Date: Mon, 1 Oct 2012 15:56:32 +0200 Subject: [torqueusers] Torque 4.X status Message-ID: <5069A110.60202@sara.nl> I am in the process installing torque 4.X on our test cluster and i am following the trunk branch. I know it is bleeding edge. What i want to know is how reliable is the server communication for the Torque 4.X version. I know installed the trunk version (revision 6869) and can not connect to the server with torque commands. I tried a lot of options and even recreate the server database but no luck. I had the same problem a couple of weeks ago and after some svn updates the problem went away. So i suppose that there are a lot of changes in the server/client communication area. error message from trqauth: {{{ Can not send close message to pbs_server!! (socket #5) Conn to gb-r7n1 port 15001 Fail. Conn 56961 not authorized (dm = 8, Err Num 15033) Error (9-Bad file descriptor) writing 15 bytes to socket (write_socket) data [+2+22+591+4root] Can not send close message to pbs_server!! (socket #5) }}} error message pbsnodes: {{{ 15:55 gb-r7n1.irc.sara.nl:/root root# pbsnodes Error code - 15033 : message [Batch protocol error] parse_daemon_response error Error communicating with gb-r7n1.irc.sara.nl(192.168.145.17) Communication failure. pbsnodes: cannot connect to server gb-r7n1.irc.sara.nl, error=15096 (Error getting connection to socket) }}} -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121001/8bd07be4/attachment.bin From listsarnau at gmail.com Mon Oct 1 08:24:45 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Mon, 1 Oct 2012 16:24:45 +0200 Subject: [torqueusers] health check In-Reply-To: <20120827205411.GN30193@lbl.gov> References: <20120827205411.GN30193@lbl.gov> Message-ID: <20121001162445.7a43ca24@amarrosa.pic.es> On Mon, 27 Aug 2012 13:54:13 -0700 Michael Jennings wrote: Hi Michael, [...] > And for those who are already using it, I know I've been quiet, but > the new release will be out very soon with some great new features! > :-) Do you have a release date for new NHC ? We'd like to start using a health check and your projects looks great, but if a new release is coming soon, we could wait ... > HTH, > Michael TIA, Arnau From mej at lbl.gov Mon Oct 1 11:26:33 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 1 Oct 2012 10:26:33 -0700 Subject: [torqueusers] health check In-Reply-To: <20121001162445.7a43ca24@amarrosa.pic.es> References: <20120827205411.GN30193@lbl.gov> <20121001162445.7a43ca24@amarrosa.pic.es> Message-ID: <20121001172632.GK8827@lbl.gov> On Monday, 01 October 2012, at 16:24:45 (+0200), Arnau Bria wrote: > Do you have a release date for new NHC ? We'd like to start using a > health check and your projects looks great, but if a new release is > coming soon, we could wait ... I haven't gotten any reports of bugs or problems with the beta, and it's working fine on our systems, so my plan is to update the documentation and release it ASAP (hopefully in the next 2-3 days). But unless something changes, it will be exactly the same as the beta, so feel free to give it a try if you'd like. :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From samuel at unimelb.edu.au Mon Oct 1 17:16:01 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 02 Oct 2012 09:16:01 +1000 Subject: [torqueusers] svn server unreachable In-Reply-To: <506965AD.3090605@sara.nl> References: <506965AD.3090605@sara.nl> Message-ID: <506A2431.6090009@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/10/12 19:43, Bas van der Vlies wrote: > svn: Can't connect to host 'opensvn.adaptivecomputing.com': Network > is unreachable Seems to be up now, my "git svn rebase" just worked and pulled changes OK. - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBqJDEACgkQO2KABBYQAh8s/wCaAuAlcK0vZ782Fe6D3JORUKBt C8QAoIUQM3t3PYH3TxQC+9uGxi8GdkSW =ww9z -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Oct 2 00:08:46 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 02 Oct 2012 16:08:46 +1000 Subject: [torqueusers] changing torque variables (PBS_O_*) BEFORE job is executed but AFTER submission In-Reply-To: <5056130E.5030406@cern.ch> References: <5056130E.5030406@cern.ch> Message-ID: <506A84EE.2030804@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 17/09/12 03:57, Adrian Sevcenco wrote: > Hi! I want to change the PBS_O_WORKDIR BEFORE a job is executed > but after submission by GRID middleware ... is it possible and > how? How about a submit filter ? www.clusterresources.com/torquedocs21/a.jqsubwrapper.shtml Don't use a shell as their example does, that will be a world of pain, use Perl or Python instead.. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBqhO4ACgkQO2KABBYQAh/ZjACgjJoSoGH4w7zEGHI1HANRwwYR MlUAn0jMAvyf/QvspKXJo9IVHQ9vrVU9 =uXbo -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Oct 2 00:10:40 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 02 Oct 2012 16:10:40 +1000 Subject: [torqueusers] spreading load over nodes In-Reply-To: <74C3EAAEAFC2E746A6BDC0F215CABBD704AD2F@wbm-mail.bmt-wbm.local> References: <74C3EAAEAFC2E746A6BDC0F215CABBD704AD2F@wbm-mail.bmt-wbm.local> Message-ID: <506A8560.1050700@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/09/12 08:57, Rob Holmes wrote: > Hi all, Hi Rob, > We have a small cluster (running torque & pbs_sched) that is only > occasionally fully utilized. Most of the time there are not enough > jobs to fill all the nodes. > > When a job is submitted it always goes to the first available node > in the node list, resulting in ?node01? getting significantly more > work than ?node14?. I?m keen to spread the workload more evenly > across each node. Is there a way to get torque to pick a free node > at random, rather than the first free node on the list? I'd suggest looking at using Maui instead of pbs_sched. The one you're using is a pretty simple one that (as you've seen) tends to just deal with the nodes in order. Maui is much more flexible. The downside is that Maui is much more flexible.. :-) Good luck! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEUEARECAAYFAlBqhWAACgkQO2KABBYQAh8YkQCY8GpkjVkN/lMY3NdoTa8uHSGj UgCfZD84E63T5nrRfuHd5Y8BemrfhJo= =fzpy -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Tue Oct 2 00:16:43 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 02 Oct 2012 16:16:43 +1000 Subject: [torqueusers] inspecting running jobs In-Reply-To: <1346860210.93300.YahooMailNeo@web111701.mail.gq1.yahoo.com> References: <1346860210.93300.YahooMailNeo@web111701.mail.gq1.yahoo.com> Message-ID: <506A86CB.9030102@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Mahmood, On 06/09/12 01:50, Mahmood Naderan wrote: > Assume I have 20 running jobs. When I use "top" command, I see > that there is one process (which is one of my jobs) that uses a lot > of memory. How can I find which job number it is? One option would be to set up the system to use cpusets, then you could see from /proc/$PID/cpuset what the cpuset is and that will tell you the job ID for it. Hope this helps! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlBqhsoACgkQO2KABBYQAh9FDACggXc8abd/kD+ZhvAcRzb2mvp6 UtMAnRMuLuVKtagJqT+luymkPKd2rjnC =b6eD -----END PGP SIGNATURE----- From Rob.Holmes at bmtwbm.com.au Tue Oct 2 00:32:20 2012 From: Rob.Holmes at bmtwbm.com.au (Rob Holmes) Date: Tue, 2 Oct 2012 06:32:20 +0000 Subject: [torqueusers] spreading load over nodes In-Reply-To: <506A8560.1050700@unimelb.edu.au> References: <74C3EAAEAFC2E746A6BDC0F215CABBD704AD2F@wbm-mail.bmt-wbm.local> <506A8560.1050700@unimelb.edu.au> Message-ID: <74C3EAAEAFC2E746A6BDC0F215CABBD704D2D3@wbm-mail.bmt-wbm.local> Yes I assumed from the lack of replies until now that there was no way to do it in pbs_sched. Sounds like it's time to bite the bullet and have a go at getting Maui up and running. Thanks Chris BMT WBM Pty Ltd Level 8, 200 Creek Street Brisbane QLD 4000 Australia P: +61 7 3831 6744 F: W: www.bmtwbm.com.au E-mail confidentiality notice and disclaimer: The contents of this e-mail are intended for the use of the mail addressee(s) shown. If you are not that person, you are not allowed to read, action, copy, forward, distribute or disclose the contents and you should delete it from your system. BMT WBM accepts no liability for any errors or omissions in the content of this e-mail, nor does it accept liability for statements which are those of the author and clearly not made on behalf of the company. Commercial Terms and Conditions: Unless otherwise agreed by BMT WBM in writing, all services or products supplied by BMT WBM shall be subject to and governed by BMT WBM's standard terms and conditions, which are available upon request. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Christopher Samuel Sent: Tuesday, 2 October 2012 04:11 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] spreading load over nodes -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/09/12 08:57, Rob Holmes wrote: > Hi all, Hi Rob, > We have a small cluster (running torque & pbs_sched) that is only > occasionally fully utilized. Most of the time there are not enough > jobs to fill all the nodes. > > When a job is submitted it always goes to the first available node in > the node list, resulting in ?node01? getting significantly more work > than ?node14?. I?m keen to spread the workload more evenly across > each node. Is there a way to get torque to pick a free node at > random, rather than the first free node on the list? I'd suggest looking at using Maui instead of pbs_sched. The one you're using is a pretty simple one that (as you've seen) tends to just deal with the nodes in order. Maui is much more flexible. The downside is that Maui is much more flexible.. :-) Good luck! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEUEARECAAYFAlBqhWAACgkQO2KABBYQAh8YkQCY8GpkjVkN/lMY3NdoTa8uHSGj UgCfZD84E63T5nrRfuHd5Y8BemrfhJo= =fzpy -----END PGP SIGNATURE----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jpeltier at sfu.ca Tue Oct 2 02:24:34 2012 From: jpeltier at sfu.ca (James A. Peltier) Date: Tue, 2 Oct 2012 01:24:34 -0700 (PDT) Subject: [torqueusers] [Mauiusers] Maui/Torque with node properties In-Reply-To: <00DF894B-F5A2-4F73-BD12-1BFC0C6A562A@grs-sim.de> Message-ID: <1699892418.28376862.1349166274496.JavaMail.root@jaguar10.sfu.ca> check out ENABLEMULTIREQJOBS TRUE and JOBNODEMATCHPOLICY EXACTNODE ----- Original Message ----- | Dear all, | | I have 4 nodes with the following properties | | node1 fast | node2 fast | node3 slow | node4 slow | | Traditionally, torque allows to request nodes with different | properties by | | qsub -l nodes=1:fast+1:slow | | The above should allocate one fast node and one slow node and this | works perfectly fine when pbs_sched is used. | | But when I use maui as my scheduler, I never get the nodes assigned | and end up waiting infinitely. | | Is this feature supported in maui? Until now, I haven't read anywhere | that this feature is not supported in maui. | Or, am I just missing something here? | | Best, | Suraj | _______________________________________________ | mauiusers mailing list | mauiusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/mauiusers | -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier Success is to be measured not so much by the position that one has reached in life but as by the obstacles they have overcome. - Booker T. Washington From listsarnau at gmail.com Tue Oct 2 03:01:04 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Tue, 2 Oct 2012 11:01:04 +0200 Subject: [torqueusers] health check In-Reply-To: <20121001172632.GK8827@lbl.gov> References: <20120827205411.GN30193@lbl.gov> <20121001162445.7a43ca24@amarrosa.pic.es> <20121001172632.GK8827@lbl.gov> Message-ID: <20121002110104.69061598@amarrosa.pic.es> On Mon, 1 Oct 2012 10:26:33 -0700 Michael Jennings wrote: [...] > But unless something changes, it will be exactly the same as the beta, > so feel free to give it a try if you'd like. :-) Will do. Could you please give me the link? http://warewulf.lbl.gov/downloads/releases/rhel5/ I've only found 1.1.4-1 > Michael Arnau From basv at sara.nl Tue Oct 2 03:25:10 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 2 Oct 2012 11:25:10 +0200 Subject: [torqueusers] Torque 4.X status In-Reply-To: <5069A110.60202@sara.nl> References: <5069A110.60202@sara.nl> Message-ID: <506AB2F6.1030205@sara.nl> On 10/01/2012 03:56 PM, Bas van der Vlies wrote: > I am in the process installing torque 4.X on our test cluster and i am following the trunk branch. I know it is bleeding edge. > > What i want to know is how reliable is the server communication for the Torque 4.X version. I know installed the trunk version > (revision 6869) and can not connect to the server with torque commands. I tried a lot of options and even recreate the server > database but no luck. I had the same problem a couple of weeks ago and after some svn updates the problem went away. So i suppose > that there are a lot of changes in the server/client communication area. > > error message from trqauth: > {{{ > Can not send close message to pbs_server!! (socket #5) > Conn to gb-r7n1 port 15001 Fail. Conn 56961 not authorized (dm = 8, Err Num 15033) > Error (9-Bad file descriptor) writing 15 bytes to socket (write_socket) data [+2+22+591+4root] > Can not send close message to pbs_server!! (socket #5) > }}} > > error message pbsnodes: > {{{ > 15:55 gb-r7n1.irc.sara.nl:/root > root# pbsnodes > Error code - 15033 : message [Batch protocol error] > parse_daemon_response error > Error communicating with gb-r7n1.irc.sara.nl(192.168.145.17) > Communication failure. > pbsnodes: cannot connect to server gb-r7n1.irc.sara.nl, error=15096 (Error getting connection to socket) > }}} > > Just updated the trunk to version 6873 and evrything works as expected ;-) -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121002/b4979c48/attachment.bin From knielson at adaptivecomputing.com Tue Oct 2 08:56:33 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 2 Oct 2012 08:56:33 -0600 Subject: [torqueusers] Just a test Message-ID: Just a test to see if we are up. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121002/6fa2e75e/attachment.html From mej at lbl.gov Tue Oct 2 12:09:20 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 2 Oct 2012 11:09:20 -0700 Subject: [torqueusers] health check In-Reply-To: <20121002110104.69061598@amarrosa.pic.es> References: <20120827205411.GN30193@lbl.gov> <20121001162445.7a43ca24@amarrosa.pic.es> <20121001172632.GK8827@lbl.gov> <20121002110104.69061598@amarrosa.pic.es> Message-ID: <20121002180912.GR8827@lbl.gov> On Tuesday, 02 October 2012, at 11:01:04 (+0200), Arnau Bria wrote: > Will do. Could you please give me the link? > > http://warewulf.lbl.gov/downloads/releases/rhel5/ > I've only found 1.1.4-1 The tarball for the beta is available here: http://warewulf.lbl.gov/downloads/beta/warewulf-nhc-1.2beta1.tar.gz I didn't post any RPMs for the beta, but you can create them from the tarball using this command: rpmbuild -ta warewulf-nhc-1.2beta1.tar.gz If you'd prefer to wait, I'll have "official" RPMs in the next few days as part of the release. :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From bdandrus at nps.edu Wed Oct 3 12:47:26 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 3 Oct 2012 18:47:26 +0000 Subject: [torqueusers] using GPU count in PRIORITYF Message-ID: All, I am currently using: NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='SPEED + .01 * AMEM - 10 * JOBCOUNT' But what I would like to add is something like: NODECFG[DEFAULT] PRIORITYF='SPEED + .01 * AMEM - 10 * JOBCOUNT - 10* GPUS' Only there doesn't seem to be any variable that will tell how many GPUs are on a node that can be used in this context. Or is there? Anyone know of one or a workaround with the same net result? Thanks in advance, Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From mej at lbl.gov Wed Oct 3 17:14:03 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 3 Oct 2012 16:14:03 -0700 Subject: [torqueusers] ANNOUNCE: Warewulf Node Health Check 1.2 Released Message-ID: <20121003231402.GK8827@lbl.gov> Just wanted to let everyone know that NHC 1.2 is now officially out. No changes from the beta. We're running it in production with TORQUE 4.1.x with excellent results. :-) Release announcement: https://groups.google.com/a/lbl.gov/forum/?fromgroups=#!topic/warewulf/ZQcld9bHdaQ Download site: http://warewulf.lbl.gov/downloads/releases/warewulf-nhc/ Updated documentation: http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check YUM Repository Files: http://warewulf.lbl.gov/downloads/repo/warewulf-rhel5.repo http://warewulf.lbl.gov/downloads/repo/warewulf-rhel6.repo Let me know if you have any questions/concerns/problems! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From s.prabhakaran at grs-sim.de Wed Oct 3 17:50:16 2012 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Thu, 04 Oct 2012 01:50:16 +0200 Subject: [torqueusers] [Mauiusers] Maui/Torque with node properties In-Reply-To: <1699892418.28376862.1349166274496.JavaMail.root@jaguar10.sfu.ca> References: <1699892418.28376862.1349166274496.JavaMail.root@jaguar10.sfu.ca> Message-ID: <43A24DFF-7A40-4C12-95B8-5118C0502A80@grs-sim.de> Thank you! It works! On Oct 2, 2012, at 10:24 AM, James A. Peltier wrote: > check out ENABLEMULTIREQJOBS TRUE and JOBNODEMATCHPOLICY EXACTNODE > > ----- Original Message ----- > | Dear all, > | > | I have 4 nodes with the following properties > | > | node1 fast > | node2 fast > | node3 slow > | node4 slow > | > | Traditionally, torque allows to request nodes with different > | properties by > | > | qsub -l nodes=1:fast+1:slow > | > | The above should allocate one fast node and one slow node and this > | works perfectly fine when pbs_sched is used. > | > | But when I use maui as my scheduler, I never get the nodes assigned > | and end up waiting infinitely. > | > | Is this feature supported in maui? Until now, I haven't read anywhere > | that this feature is not supported in maui. > | Or, am I just missing something here? > | > | Best, > | Suraj > | _______________________________________________ > | mauiusers mailing list > | mauiusers at supercluster.org > | http://www.supercluster.org/mailman/listinfo/mauiusers > | > > -- > James A. Peltier > Manager, IT Services - Research Computing Group > Simon Fraser University - Burnaby Campus > Phone : 778-782-6573 > Fax : 778-782-3045 > E-Mail : jpeltier at sfu.ca > Website : http://www.sfu.ca/itservices > http://blogs.sfu.ca/people/jpeltier > > Success is to be measured not so much by the position that one has reached > in life but as by the obstacles they have overcome. - Booker T. Washington -------------------------- Suraj Prabhakaran German Research School for Simulation Sciences GmbH Laboratory for Parallel Progreamming 52062 Aachen | Germany Tel +49 241 80 99743 Fax +49 241 80 92742 EMail s.prabhakaran at grs-sim.de Web www.grs-sim.de Members: Forschungszentrum J?lich GmbH | RWTH Aachen University Registered in the commercial register of the local court of D?ren (Amtsgericht D?ren) under registration number HRB 5268 Registered office: J?lich Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121004/df51de1c/attachment.html From s.prabhakaran at grs-sim.de Sat Oct 6 17:28:32 2012 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Sun, 07 Oct 2012 01:28:32 +0200 Subject: [torqueusers] Restricting allocation of certain node types Message-ID: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> Dear all, Is there a way to tell maui not to allocate certain "type" of nodes unless and until it has been asked for? For example, I have four nodes node1 np=4 slow node2 np=4 slow node3 np=4 fast node4 np=4 fast Here, I would like to have maui allocate only the "slow" nodes by default. If the slow nodes are not available and a new job with a simple request "-l nodes=1" comes up, it should be queued rather than having a fast node allocated for it. However, of course if the "fast" job is explicitly asked for, then it can be scheduled. That is "-l nodes=1:fast" should be accepted and allocated one of the free fast nodes. Is there a way to do this? Thanks, Suraj From samuel at unimelb.edu.au Sun Oct 7 20:52:07 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 08 Oct 2012 13:52:07 +1100 Subject: [torqueusers] Restricting allocation of certain node types In-Reply-To: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> References: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> Message-ID: <50723FD7.4010402@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/10/12 10:28, Suraj Prabhakaran wrote: > Here, I would like to have maui allocate only the "slow" nodes by > default. If the slow nodes are not available and a new job with a > simple request "-l nodes=1" comes up, it should be queued rather > than having a fast node allocated for it. However, of course if the > "fast" job is explicitly asked for, then it can be scheduled. That > is "-l nodes=1:fast" should be accepted and allocated one of the > free fast nodes. > > Is there a way to do this? I suspect that you would need to do that with a standing reservation (SRCFG) and settings ACLs such that only jobs that are in a special queue or request a special resource are granted access. cheers! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iEYEARECAAYFAlByP9YACgkQO2KABBYQAh8vDQCfYj6ytC6YxILQDwCaPu7tKBGl C+QAoIBwT+7/+YD8WSMxiCv/b2mJfW4k =2rVs -----END PGP SIGNATURE----- From d-ulrick at comcast.net Mon Oct 8 14:26:31 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Mon, 8 Oct 2012 15:26:31 -0500 (CDT) Subject: [torqueusers] Cleaning up stray processes from defunct jobs In-Reply-To: <1348782004.15740.165.camel@browncoat.jics.utk.edu> References: <1348782004.15740.165.camel@browncoat.jics.utk.edu> Message-ID: On Thu, 27 Sep 2012, Troy Baer wrote: > On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote: >> On occasion I see a user run an MPI job via TORQUE that doesn't shut down >> cleanly and as a result leaves running processes behind to interfere with >> subsequent jobs that are assigned to its nodes. Any suggestions on how I >> might go about simplifying the task of finding and killing these >> processes? > > I would recommend running something like reaver [1] in your > epilogue.parallel on each node. > > [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver > > --Troy I've deployed reaver to my compute nodes and have run some test jobs. It appears that TORQUE runs 'epilogue' on the job head node and 'epilogue.parallel' on the sister nodes so I've got both scripts set up to run reaver. I don't have a job at hand that will create stray processes so I'll just wait and see what reaver does the next time such a job runs. Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From tbaer at utk.edu Mon Oct 8 14:32:45 2012 From: tbaer at utk.edu (Troy Baer) Date: Mon, 8 Oct 2012 16:32:45 -0400 Subject: [torqueusers] Cleaning up stray processes from defunct jobs In-Reply-To: References: <1348782004.15740.165.camel@browncoat.jics.utk.edu> Message-ID: <1349728365.15740.485.camel@browncoat.jics.utk.edu> On Mon, 2012-10-08 at 15:26 -0500, Dave Ulrick wrote: > On Thu, 27 Sep 2012, Troy Baer wrote: > > On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote: > >> On occasion I see a user run an MPI job via TORQUE that doesn't shut down > >> cleanly and as a result leaves running processes behind to interfere with > >> subsequent jobs that are assigned to its nodes. Any suggestions on how I > >> might go about simplifying the task of finding and killing these > >> processes? > > > > I would recommend running something like reaver [1] in your > > epilogue.parallel on each node. > > > > [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver > > > > --Troy > > I've deployed reaver to my compute nodes and have run some test jobs. It > appears that TORQUE runs 'epilogue' on the job head node and > 'epilogue.parallel' on the sister nodes so I've got both scripts set up to > run reaver. I don't have a job at hand that will create stray processes so > I'll just wait and see what reaver does the next time such a job runs. Be aware that reaver doesn't kill processes unless you specifically tell it to do so with the -k option. I would recommend running in the default identification-only mode for a while until you're sure that it's consistently identifying processes that need killed. --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From jonathan.barber at gmail.com Wed Oct 3 09:18:45 2012 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Wed, 3 Oct 2012 16:18:45 +0100 Subject: [torqueusers] Fwd: Torque 4.1.2: MUNGE vs trqauthd In-Reply-To: References: Message-ID: I'm resending this because it didn't appear to hit the mailing list first time round. Apologies if I'm mistaken and it did. ---------- Forwarded message ---------- From: Jonathan Barber Date: 11 September 2012 22:48 Subject: Torque 4.1.2: MUNGE vs trqauthd To: torqueusers at supercluster.org I'm looking at torque 4.1.2 and trying to work out what the difference is between MUNGE and trqauthd. According to the fine documentation here: http://www.adaptivecomputing.com/resources/docs/torque/4-0/Content/topics/1-installConfig/configuringTrqauthdForClientCom.htm and here: http://www.adaptivecomputing.com/resources/docs/torque/4-0/Content/topics/1-installConfig/serverConfig.htm#usingMUNGEAuth they both seem to have the purpose of confirming the user's identity to the Torque server, and are exclusive options (according to the function pbs_original_connect in pbsD_connect.c). Am I right in thinking this? If so, which one should I be using and why? Regards -- Jonathan Barber -- Jonathan Barber From Paul.Marshall at Colorado.EDU Thu Oct 4 09:39:55 2012 From: Paul.Marshall at Colorado.EDU (Paul D Marshall) Date: Thu, 4 Oct 2012 09:39:55 -0600 Subject: [torqueusers] problems with server and mom communication Message-ID: <453B3867-1FFE-43FB-86F2-37383B3F4B49@colorado.edu> Hello, Torque is having trouble marking jobs that complete as done on the server (they simply stick in the running state). I can submit the jobs fine, they appear to run successfully and the pbs_mom notes their termination and then attempts to notify the server. However at that point pbs_mom hits this error: pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in scan_for_exiting, I've run into this problem with Torque 2.5.12, 3.0.6, 4.0.2, and 4.1.2 and somewhat basic/default configurations. I've tried increasing the various timeouts as well as mom_job_sync, none of which seems to help. I did a bit more digging in 2.5.12 and it appears to fail on the bind in client_to_srv in src/lib/Libnet/net_client.c, claiming that the address is already in use, despite the fact that it should try different ports (from what I can tell). Has anyone else run into this and/or have suggestions? I believe I have hostname resolution setup appropriately, but it's possible something is off slightly (issues with hostname resolution is the most I've been able to gather from the internets as to what might be at the root of pbs_server/mom communication problems). In this setup the pbs_server and pbs_mom are on different networks, latency is on the order of 10's of ms instead of sub-ms. Thanks, Paul From giuseppe.grieco at gmail.com Wed Oct 3 06:32:59 2012 From: giuseppe.grieco at gmail.com (Giuseppe Grieco) Date: Wed, 3 Oct 2012 14:32:59 +0200 Subject: [torqueusers] my node is down Message-ID: Hi all, I installed torque 4.1.0. I could install it correctly but I cannot launch any job. When I launch the command pbs_server and after pbs_mom, after pbsnodes -a I experience the following message: applied_spectroscopy state = down np = 6 ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 The machine where I installed torque is not a cluster. It is equipped with 6 CPU and the files server_priv/nodes and mom_priv/config are the following server_priv/nodes applied_spectroscopy np=6 mom_priv/config $pbsserver applied_spectroscopy $logevent 255 In log file server_log/20121003 I have the following error message 10/03/2012 14:23:22;0006;PBS_Server.17957;Svr;PBS_Server;Server applied_spectroscopy started, initialization type = 1 10/03/2012 14:23:22;0002;PBS_Server.17957;Svr;get_default_threads;Defaulting min_threads to 25 threads 10/03/2012 14:23:22;0002;PBS_Server.17957;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20121003 opened 10/03/2012 14:23:22;0040;PBS_Server.17957;Req;setup_nodes;setup_nodes() 10/03/2012 14:23:22;0086;PBS_Server.17957;Svr;PBS_Server;Recovered queue batch 10/03/2012 14:23:22;0002;PBS_Server.17957;Svr;PBS_Server;Expected 1, recovered 1 queues 10/03/2012 14:23:22;0080;PBS_Server.17957;Svr;PBS_Server;2 total files read from disk 10/03/2012 14:23:22;0002;PBS_Server.17957;Svr;PBS_Server;handle_job_recovery:3 10/03/2012 14:23:22;0006;PBS_Server.17957;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'applied_spectroscopy') 10/03/2012 14:23:22;0002;PBS_Server.17957;Svr;PBS_Server;Server Ready, pid = 17957, loglevel=0 10/03/2012 14:23:22;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 127.0.1.1:15003] 10/03/2012 14:23:22;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host applied_spectroscopy:15003 10/03/2012 14:23:37;0002;PBS_Server.17961;Svr;PBS_Server;Torque Server Version = 4.1.2, loglevel = 0 10/03/2012 14:23:42;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 127.0.1.1:15003] 10/03/2012 14:23:42;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host applied_spectroscopy:15003 10/03/2012 14:24:02;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 127.0.1.1:15003] 10/03/2012 14:24:02;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host applied_spectroscopy:15003 10/03/2012 14:24:22;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 127.0.1.1:15003] 10/03/2012 14:24:22;0001;PBS_Server.17960;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host applied_spectroscopy:15003 10/03/2012 14:28:37;0002;PBS_Server.17961;Svr;PBS_Server;Torque Server Version = 4.1.2, loglevel = 0 It seems I have some problems in the configuration process but I cannot understand what. Can anyone help me? Thanks in advance, Giuseppe -- Dr. Giuseppe Grieco Post Doc School of Engineering University of Basilicata Tel. 00390971205158 From Edric.Ellis at mathworks.co.uk Wed Oct 3 09:08:10 2012 From: Edric.Ellis at mathworks.co.uk (Edric Ellis) Date: Wed, 3 Oct 2012 15:08:10 +0000 Subject: [torqueusers] Torque versions and the "-v" argument Message-ID: <9F83CE06F097B84881BCB0F9C477CF44466BB893@exmb-01-uk.ad.mathworks.com> Hi there torqueusers, I'm trying to write some scripts that can be used with multiple versions of Torque, and I'm having some trouble getting uniform support for "-v". What I have that works nicely with Torque 2.5.5 is to write a qsub line like this: qsub -v VARNAME1,VARNAME2 ... We've now got a Torque 4.1.2 cluster here, and that syntax doesn't appear to work (even though it is documented to work in the qsub man page). For example: $ echo env | FOO1=foo1 FOO2=foo2 qsub -v FOO1,FOO2 # time passes $ grep FOO STDIN.o401 # no matches Whereas this syntax does work: $ echo env | FOO1=foo1 FOO2=foo2 qsub -v FOO1 -v FOO2 # time passes $ grep FOO STDIN.o402 FOO1=foo1 FOO2=foo2 Unfortunately, that second syntax doesn't work with Torque 2.5.5. The only syntax that appears to work correctly in both versions is this one: qsub -v FOO1=foo1,FOO2=foo2 However I'd rather not do that if possible since that means firstly that the command line can get very long, and also that I'd have to worry about the contents of the environment variables (in case they need quoting etc.). Is the behaviour of Torque 4.1.2 in the "-v VARNAME1,VARNAME2" variant a bug? Cheers, Edric. From bunk at physik.hu-berlin.de Mon Oct 8 03:03:27 2012 From: bunk at physik.hu-berlin.de (Burkhard Bunk) Date: Mon, 8 Oct 2012 11:03:27 +0200 (CEST) Subject: [torqueusers] Restricting allocation of certain node types In-Reply-To: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> References: <865CFF4F-9945-4F76-BEA6-29DAF5F3F982@grs-sim.de> Message-ID: Hi, as a simple solution, you may try configure the default settings in torque (not maui) with something like resources_default.nodes = slow or resources_default.nodes = 1:slow either at queue level or even for the server as a whole. These can be overriden by qsub options (in contrast to settings of "neednodes", which are mandatory for the users). I haven't tried this with node properties so far, but I know that "resources_default.nodes = 1:ppn=4" provides a default allocation which can be changed by explicit user commands. Regards, Burkhard Bunk. ---------------------------------------------------------------------- bunk at physik.hu-berlin.de Physics Institute, Humboldt University fax: ++49-30 2093 7628 Newtonstr. 15 phone: ++49-30 2093 7980 12489 Berlin, Germany ---------------------------------------------------------------------- On Sun, 7 Oct 2012, Suraj Prabhakaran wrote: > Dear all, > > Is there a way to tell maui not to allocate certain "type" of nodes unless and until it has been asked for? > > For example, I have four nodes > > node1 np=4 slow > node2 np=4 slow > node3 np=4 fast > node4 np=4 fast > > Here, I would like to have maui allocate only the "slow" nodes by default. If the slow nodes are not available and a new job with a simple request "-l nodes=1" comes up, it should be queued rather than having a fast node allocated for it. However, of course if the "fast" job is explicitly asked for, then it can be scheduled. That is "-l nodes=1:fast" should be accepted and allocated one of the free fast nodes. > > Is there a way to do this? > > Thanks, > Suraj > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From ablenzo at hotmail.com Tue Oct 9 08:19:47 2012 From: ablenzo at hotmail.com (Ablen) Date: Tue, 9 Oct 2012 14:19:47 +0000 (UTC) Subject: [torqueusers] Torque Jobs Stay Queued Message-ID: Hello friends, I am working to install Torque on a FC16 Linux Cluster. So far I have added only what I need for it to run on the master node - and I think I have done everything correctly. When I submit a job, however, it shows up as being in the queued state - and won't run. I think there must be a minor step I've missed. Below are the steps I've taken to set up torque as well as the sample job I am trying. Could someone please let me know what I may still need to do in order for this job to run? All comments appreciated. Many thanks. ablen 1 ? Log into server as root 2 ? Edit the /etc/hosts file and change the first line so it looks like this: 127.0.0.1 mysrv localhost.localdomain localhost 3 ? yum install openssl-devel 3 - yum install libxml2-devel 4 ? yum ?y install ?torque*? 5 - pbs_server ?t create 6 - systemctl start pbs_{mom,server,sched}.service 7 - systemctl enable pbs_{mom,server,sched}.service 8 - /usr/local/sbin/trqauthd start 9 ? pbs_server 10 ? vi /var/spool/torque/server_name and also vi /etc/torque/server_name change server name to mysrv if needed 11 ? vi /var/spool/torque/mom_priv/config and vi /etc/torque/mom/config add/modify this line: $pbsserver mysrv 12 ? vi /var/spool/torque/server_priv/nodes (create this file) and add all nodes in the cluster like this (np for number of processors ? VERIFY THAT THESE ARE 4 processors ea). mysrv np=4 node2 np=4 node3 np=4 ? 13 - vi /etc/sysconfig/network (and make sure that HOSTNAME is set as follows): HOSTNAME=mysrv 14 - Append these lines to the /etc/profile file (for bash) PBS_DEFAULT=mysrv export PBS_DEFAULT Append these lines to the /etc/bashrc file (also for bash) PBS_DEFAULT=mysrv export PBS_DEFAULT 15 ? execute all of the following commands: qmgr -c "set server operators += root at mysrv" qmgr -c "set server managers += root at mysrv" qmgr -c 'create queue batch' qmgr -c 'set queue batch queue_type = execution' qmgr -c 'set queue batch started = true' qmgr -c 'set queue batch enabled = true' qmgr -c 'set queue batch resources_default.walltime = 480:00:00' qmgr -c 'set queue batch resources_default.nodes = 1' qmgr -c 'set queue batch max_running = 1000' qmgr -c 'set server default_queue = batch' 16 ? Log into a non-root account and run these commands as a preliminary test: qmgr -c "list server" qmgr -c "list queue batch" 17 ? Submit a test job from the nonroot account, then view it using qstat: echo "sleep 30" | qsub qstat Results look like this: [mine at mysrv ~]$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 0.mysrv STDIN mine 0 Q batch 1.mysrv STDIN mine 0 Q batch [mine at mysrv ~]$ From zacharybs at ornl.gov Tue Oct 9 13:13:42 2012 From: zacharybs at ornl.gov (Zachary, Brian S.) Date: Tue, 9 Oct 2012 15:13:42 -0400 Subject: [torqueusers] Hardcoded CPU time limit? In-Reply-To: <60D410E9-382A-4C42-B09E-0AE10BDBDB95@ornl.gov> References: <60D410E9-382A-4C42-B09E-0AE10BDBDB95@ornl.gov> Message-ID: Looks like a bunch of moderated messages were just let loose on the list? we've since figured out the problem, that our resources_default.cput was set to 10,000 hours and our users weren't specifying a cpu time constraint in their submission scripts. Thanks, Brian On Aug 20, 2012, at 2:40 PM, Zachary, Brian S. wrote: > Hello, > We are running torque 3.0.2, and are seeing an issue where jobs that run longer than 10,000 hours are killed with a message similar to "PBS: job killed: cput job total 36010171 secs exceeded limit 36000000 secs". This despite that the queue we are seeing the problem on is configured with "resources_max.cput = 24000:00:00". > > Does anyone know how to get around this limit, either through source modification, configuration changes, use a different version of torque, etc.? > > Thanks, > Brian Zachary > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2942 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121009/792e4cdb/attachment-0001.bin From gus at ldeo.columbia.edu Tue Oct 9 13:53:09 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 09 Oct 2012 15:53:09 -0400 Subject: [torqueusers] Torque Jobs Stay Queued In-Reply-To: References: Message-ID: <507480A5.4060302@ldeo.columbia.edu> Is this missing, perhaps? qmgr -c 'set server scheduling = True' It may help if you send the output of qmgr -c 'print server' Gus Correa On 10/09/2012 10:19 AM, Ablen wrote: > Hello friends, > > I am working to install Torque on a FC16 Linux Cluster. So far I have added > only what I need for it to run on the master node - and I think I have done > everything correctly. When I submit a job, however, it shows up as being in the > queued state - and won't run. I think there must be a minor step I've missed. > Below are the steps I've taken to set up torque as well as the sample job I am > trying. Could someone please let me know what I may still need to do in order > for this job to run? All comments appreciated. > > Many thanks. > ablen > > 1 ? Log into server as root > 2 ? Edit the /etc/hosts file and change the first line so it looks like this: > > 127.0.0.1 mysrv localhost.localdomain localhost > > 3 ? yum install openssl-devel > 3 - yum install libxml2-devel > 4 ? yum ?y install ?torque*? > 5 - pbs_server ?t create > 6 - systemctl start pbs_{mom,server,sched}.service > 7 - systemctl enable pbs_{mom,server,sched}.service > 8 - /usr/local/sbin/trqauthd start > 9 ? pbs_server > 10 ? vi /var/spool/torque/server_name and also vi /etc/torque/server_name > > change server name to mysrv if needed > > 11 ? vi /var/spool/torque/mom_priv/config and vi /etc/torque/mom/config > add/modify this line: > > $pbsserver mysrv > > 12 ? vi /var/spool/torque/server_priv/nodes (create this file) and add all > nodes in the cluster like this (np for number of processors ? VERIFY THAT THESE > ARE 4 processors ea). > > mysrv np=4 > node2 np=4 > node3 np=4 > ? > > 13 - vi /etc/sysconfig/network (and make sure that HOSTNAME is set as follows): > > HOSTNAME=mysrv > > 14 - Append these lines to the /etc/profile file (for bash) > PBS_DEFAULT=mysrv > export PBS_DEFAULT > > Append these lines to the /etc/bashrc file (also for bash) > PBS_DEFAULT=mysrv > export PBS_DEFAULT > > 15 ? execute all of the following commands: > > qmgr -c "set server operators += root at mysrv" > qmgr -c "set server managers += root at mysrv" > qmgr -c 'create queue batch' > qmgr -c 'set queue batch queue_type = execution' > qmgr -c 'set queue batch started = true' > qmgr -c 'set queue batch enabled = true' > qmgr -c 'set queue batch resources_default.walltime = 480:00:00' > qmgr -c 'set queue batch resources_default.nodes = 1' > qmgr -c 'set queue batch max_running = 1000' > qmgr -c 'set server default_queue = batch' > > 16 ? Log into a non-root account and run these commands as a preliminary test: > > qmgr -c "list server" > qmgr -c "list queue batch" > > 17 ? Submit a test job from the nonroot account, then view it using qstat: > > echo "sleep 30" | qsub > qstat > > Results look like this: > > [mine at mysrv ~]$ qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 0.mysrv STDIN mine 0 Q batch > > 1.mysrv STDIN mine 0 Q batch > > [mine at mysrv ~]$ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From nt_mahmood at yahoo.com Wed Oct 10 01:13:22 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 10 Oct 2012 00:13:22 -0700 (PDT) Subject: [torqueusers] problem receiving emails Message-ID: <1349853202.56188.YahooMailNeo@web111717.mail.gq1.yahoo.com> Dear moderators, Is there any problem with mail server? I don't receive emails in a week, later I receive 100 emails at once. I am sending this email on 10/10/2012 at 7:12 AM UTC/GMT If you are going to reply this email, please specify the date and hour. ? Regards, Mahmood -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121010/83c242e0/attachment.html From d-ulrick at comcast.net Wed Oct 10 09:48:46 2012 From: d-ulrick at comcast.net (Dave Ulrick) Date: Wed, 10 Oct 2012 10:48:46 -0500 (CDT) Subject: [torqueusers] Cleaning up stray processes from defunct jobs In-Reply-To: <1349728365.15740.485.camel@browncoat.jics.utk.edu> References: <1348782004.15740.165.camel@browncoat.jics.utk.edu> <1349728365.15740.485.camel@browncoat.jics.utk.edu> Message-ID: On Mon, 8 Oct 2012, Troy Baer wrote: > On Mon, 2012-10-08 at 15:26 -0500, Dave Ulrick wrote: >> On Thu, 27 Sep 2012, Troy Baer wrote: >>> On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote: >>>> On occasion I see a user run an MPI job via TORQUE that doesn't shut down >>>> cleanly and as a result leaves running processes behind to interfere with >>>> subsequent jobs that are assigned to its nodes. Any suggestions on how I >>>> might go about simplifying the task of finding and killing these >>>> processes? >>> >>> I would recommend running something like reaver [1] in your >>> epilogue.parallel on each node. >>> >>> [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver >>> >>> --Troy >> >> I've deployed reaver to my compute nodes and have run some test jobs. It >> appears that TORQUE runs 'epilogue' on the job head node and >> 'epilogue.parallel' on the sister nodes so I've got both scripts set up to >> run reaver. I don't have a job at hand that will create stray processes so >> I'll just wait and see what reaver does the next time such a job runs. > > Be aware that reaver doesn't kill processes unless you specifically tell > it to do so with the -k option. I would recommend running in the > default identification-only mode for a while until you're sure that it's > consistently identifying processes that need killed. I've been running reaver for a few days now. I've identified a situation where a job left behind stray processes that reaver didn't remove. This apparently happened because the job's /var/spool/torque/mom_priv/jobs/foo.JB file wasn't removed when the job ended. I've got my epilogue and epilogue.parallel scripts writing to a log file whenever they run. For some reason only the 'epilogue' script ran when the job with stray processes ended. 'epilogue.parallel' wasn't run on any nodes. For many other jobs, none left stray processes, and all appear to have run both 'epilogue' and 'epilogue.parallel' scripts. Any idea of what went wrong and how to fix it? Thanks, Dave -- Dave Ulrick d-ulrick at comcast.net From ablenzo at hotmail.com Wed Oct 10 12:48:01 2012 From: ablenzo at hotmail.com (Antonino Lenzo) Date: Wed, 10 Oct 2012 14:48:01 -0400 Subject: [torqueusers] qsub produces Undefined attribute error In-Reply-To: <507480A5.4060302@ldeo.columbia.edu> References: , <507480A5.4060302@ldeo.columbia.edu> Message-ID: Hello friends. And thank you, Gus, for answering my prior question. I uninstalled torque and started fresh using the documentation as an example (/usr/share/doc/torque-3.0.3/README.Fedora). I am running FC16 as well as the latest version of torque. Although this is a cluster (3 machines) I am only working to get the master node working at this time. When I submit a job to torque I get the following error: qsub: submit error (Undefined attribute MSG=detected presence of an unknown attribute) I am not sure why. Following I have a list of steps I took in order to complete the installation. Please have a look and let me know what I might have missed. Thanks all, ablen 1 ? Log into server as root 2 ? yum install openssl-devel 3 - yum install libxml2-devel 4 ? yum ?y install ?torque*? 5 ? vi /etc/torque/server_name so that it has only these contents: mysrv 6 - vi /etc/torque/mom/config so that it has only these contents: $pbs_server mysrv 7 - /usr/sbin/pbs_server -D -t create might have to ctrl-c out of this ? if so, that seems to cause no problems. 8 ? service pbs_server start 9 - /usr/local/sbin/trqauthd start 10 ? Configure torque with these commands: qmgr -c "s s scheduling=true" qmgr -c "c q batch queue_type=execution" qmgr -c "s q batch started=true" qmgr -c "s q batch enabled=true" qmgr -c "s q batch resources_default.nodes=1" qmgr -c "s q batch resources_default.walltime=3600" qmgr -c "s s default_queue=batch" 11 ? Add one batch worker to your pbs_server: (I actually have no idea what this does) qmgr -c "c n mysrv" 12 ? Start pbs_mom and pbs_sched daemons. service pbs_mom start service pbs_sched start 13 ? Use chkconfig to start these services at boot time: /sbin/chkconfig pbs_mom on /sbin/chkconfig pbs_server on /sbin/chkconfig pbs_sched on 14 ? Use a non-root account to submit a test job and check the settings like so: pbsnodes -l free qsub < hostname > echo "Hi I am a batch job running in torque" > EOF qsub: submit error (Undefined attribute MSG=detected presence of an unknown attribute) [myaccount at mysrv ~]$ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121010/4dc8decf/attachment-0001.html From chenry at ittc.ku.edu Wed Oct 10 13:30:27 2012 From: chenry at ittc.ku.edu (Charles Henry) Date: Wed, 10 Oct 2012 14:30:27 -0500 (CDT) Subject: [torqueusers] login shells not run in torque 4 and bash -l In-Reply-To: <615566970.673728.1349896791986.JavaMail.root@ittc.ku.edu> Message-ID: <135357755.673805.1349897427580.JavaMail.root@ittc.ku.edu> Hi list, I have been following the torque 4 development, and I'm currently using torque 4.1.2 on RHEL6.2. I have found that I cannot get cluster jobs to run correctly without using "#!/bin/bash -l" in each script. A few sites (academic and government) are listing this workaround in their cluster FAQs. Our site uses mpi-selector and needs to source /etc/profile for every cluster job (interactive or not). I have looked for settings and even gone so far as reading the source code. The relevant settings are defined globally inside src/resmom/mom_main.c ... (line 205) int src_login_batch = TRUE; int src_login_interactive = TRUE; ... and in src/resmom/start_exec.c ... (line 3736) void source_login_shells_or_not( ... if (((TJE->is_interactive == TRUE) && (src_login_interactive == FALSE)) || ((TJE->is_interactive != TRUE) && (src_login_batch == FALSE))) ... Where those values are declared as "extern int", so the values from mom_main.c are accessible once the binaries are linked. There's no error message from the source_login_shells_or_not function, and the code looks very similar to the torque-3 code (except for being wrapped up into functions). Can anyone shed some light on the problem? Chuck From Alessandra.Forti at cern.ch Wed Oct 10 13:30:42 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Wed, 10 Oct 2012 20:30:42 +0100 Subject: [torqueusers] Maui is not submitting jobs to torque In-Reply-To: <50745914.3050601@cern.ch> References: <50745914.3050601@cern.ch> Message-ID: <5075CCE2.1090300@cern.ch> Hi, I have installed a mini test cluster with torque and maui. We have used maui/torque for years on our grid cluster and now we are upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination maui doesn't seem to work correctly. When I submit jobs and it behaves as if there weren't any free resources. Even when I tried to install only torque and maui with a bare minimum configuration I got the same behaviour, i.e. 1) When I submit the jobs just remain queued //[root@// maui]# /qstat -an1// // //: // //Req'd Req'd Elap// //Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time// //-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----// //10. aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- // //11.s aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- / 2) If I run qrun the job runs so I assume the problem is not between torque server and torque mom. 3) When I use showq on the old versions displayed the WCLimit of the default queue now it displays 0 at first and then it changes it by itself to 100 days /showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 16 Processors Active (0.00%) 0 of 1 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 2 aforti Idle 1 99:23:59:59 Wed Oct 10 13:36:34 3 aforti Idle 1 99:23:59:59 Wed Oct 10 14:01:43 4 aforti Idle 1 99:23:59:59 Wed Oct 10 18:50:14 5 aforti Idle 1 00:00:00 Wed Oct 10 20:29:27 4 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 4 Active Jobs: 0 Idle Jobs: 4 Blocked Jobs: 0 // // //Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked Jobs: 0// / 4) Checkjob just tells me the job cannot be run in the default partition without any particular reason /[.....] PE: 1.00 StartPriority: 120// //cannot select job 10 for partition DEFAULT (Class)/ 5) Checknode can see the node free if it wasn't clear from other commands /[root@// maui]# !checkno// //checknode // // //checking node // // //State: Idle (in current state for 00:55:10)// //Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK: 1M// //Utilized Resources: SWAP: 202M// //Dedicated Resources: [NONE]// //Opsys: linux Arch: [NONE]// //Speed: 1.00 Load: 0.000// //Network: [DEFAULT]// //Features: [lcgpro]// //Attributes: [Batch]// //Classes: [DEFAULT 1:1]// // //Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10 (0.09%)// // //Reservations:// //NOTE: no reservations on node/ 6) When I use showbf -v though it says my nodes are blocked by reservations despite checknode clearly telling me there are no reservations on that node. In our local maui.cfg there is a reservation for 1 proc I'm not sure why it blocks the whole node /[root@// server_logs]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:08:59// // // 3 procs available with no timelimit// // //node is blocked by reservation sft.0.0 in INFINITY// / But to be sure I removed it and even when I remove the reservation and reduce the maui.cfg to the default version without anything in it it tells me the node is blocked by "reservation NONE in INFINITY" /[root@// maui]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:37:58// // // 16 procs available with no timelimit// // //node is blocked by reservation NONE in INFINITY// / If I increase the maui loglevel to 9 I hundreds of these messages /10/10 13:37:39 MRMCheckEvents()// //10/10 13:37:39 INFO: no PBS sched socket connections ready// //10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)// //10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily unavailable)// //10/10 13:37:39 INFO: all clients connected. servicing requests// / which leaves me perplexed since in other places with a different log level it sees the jobs waiting on the server so somehow some comunication happens and other doesn't /10/10 20:27:24 INFO: job '2' Priority: 410// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 410(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '2' priority: 410.30// //10/10 20:27:24 MJobGetStartPriority(3,0,Priority,NULL)// //10/10 20:27:24 INFO: job '3' Priority: 385// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 385(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '3' priority: 385.30// //10/10 20:27:24 MJobGetStartPriority(4,0,Priority,NULL)// //10/10 20:27:24 INFO: job '4' Priority: 97// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 97(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '4' priority: 97.17// / Thanks for any help here are the rpms I used /maui-3.3-4.el5// //maui-client-3.3-4.el5// //maui-server-3.3-4.el5// //torque-2.5.7-7.el5// //torque-client-2.5.7-7.el5// //torque-server-2.5.7-7.el5// //libtorque-2.5.7-7.el5// / the maui.cfg /# # MAUI configuration example # @(#)maui.cfg David Groep 20031015.1 # for MAUI version 3.2.5 # SERVERHOST / /ADMIN1 root ADMINHOST / /RMTYPE[0] PBS RMHOST[0] / /RMSERVER[0] / / SERVERPORT 40559 SERVERMODE NORMAL # Set PBS server polling interval. Since we have many short jobs # and want fast turn-around, set this to 10 seconds (default: 2 minutes) RMPOLLINTERVAL 00:00:10 # a max. 10 MByte log file in a logical location LOGFILE /var/log/maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3/ and Torque config /create queue long// //set queue long queue_type = Execution// //set queue long acl_hosts = localhost// //set queue long acl_hosts += // //set queue long resources_max.cput = 48:00:00// //set queue long resources_max.walltime = 72:00:00// //set queue long acl_group_enable = True// //set queue long acl_groups = aforti// //set queue long enabled = True// //set queue long started = True// //#// //# Set server attributes.// //#// //set server scheduling = True// //set server acl_host_enable = False// //set server acl_hosts = // //set server acl_hosts += localhost// //set server default_queue = long// //set server log_events = 511// //set server mail_from = adm// //set server next_job_number = 12/ -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121010/8e5bc7cc/attachment.html From basv at sara.nl Thu Oct 11 00:31:11 2012 From: basv at sara.nl (Bas van der Vlies) Date: Thu, 11 Oct 2012 06:31:11 +0000 Subject: [torqueusers] Maui is not submitting jobs to torque In-Reply-To: <5075CCE2.1090300@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> Message-ID: <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> On 10 okt. 2012, at 21:30, Alessandra Forti > wrote: node is blocked by reservation sft.0.0 in INFINITY This message informs that there is a reservation on the node. what is the output of showres? -- Bas van der Vlies basv at sara.nl From jonathan.barber at gmail.com Thu Oct 11 05:36:02 2012 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Thu, 11 Oct 2012 12:36:02 +0100 Subject: [torqueusers] Maui is not submitting jobs to torque In-Reply-To: <5075CCE2.1090300@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> Message-ID: On 10 October 2012 20:30, Alessandra Forti wrote: > Hi, [snip] > 10/10 13:37:39 MRMCheckEvents() > 10/10 13:37:39 INFO: no PBS sched socket connections ready > 10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP) > 10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily > unavailable) > 10/10 13:37:39 INFO: all clients connected. servicing requests > > which leaves me perplexed since in other places with a different log level > it sees the jobs waiting on the server so somehow some comunication happens > and other doesn't I see these same messages from maui 3.3.1. It is probably not a problem for you, but it I believe it is a small bug in Maui. The problem is that the socket has the flag O_NONBLOCK set. However, when the MSUAcceptClient() call's accept() to see if a client is connecting, it doesn't take this into account. I've attached a patch for the attention of the maintainers. It quietens the output and works for the 3.3.1 branch and applies cleanly against the trunk (so it also work there). Regards -- Jonathan Barber -------------- next part -------------- A non-text attachment was scrubbed... Name: MSU.c.patch Type: application/octet-stream Size: 469 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121011/acb3c77a/attachment.obj From basv at sara.nl Thu Oct 11 05:46:21 2012 From: basv at sara.nl (Bas van der Vlies) Date: Thu, 11 Oct 2012 13:46:21 +0200 Subject: [torqueusers] Directory with space in the name causes that no batch .e/.o files are copied Message-ID: <5076B18D.2030709@sara.nl> Hello, We see more and more users that have directories with spaces in it. I have installed torque 2.X and 4.X. Both versions do not accept it. will this be fixed in a future releae. The problem is with the used function wordexp and that the characters that cause the problem are not escaped. See: * "src/resmom/requests.c Here is an example that works. The first invocation does not escape and see 3 words. The other see one file. To my knowledge PBS_O_WORKDIR is already expanded. {{{ #include #include #include int main(int argc, char **argv) { wordexp_t p; char **w; int i; int r; r = wordexp("/home/bas/dir with spaces/file.sh", &p, (WRDE_NOCMD|WRDE_UNDEF)); printf("exit code = %d\n", r); printf("Number of words = %d\n", p.we_wordc); w = p.we_wordv; for (i = 0; i < p.we_wordc; i++) printf("%s\n", w[i]); wordfree(&p); r = wordexp("/home/bas/'dir with spaces'/file.sh", &p, (WRDE_NOCMD|WRDE_UNDEF)); printf("exit code = %d\n", r); printf("Number of words = %d\n", p.we_wordc); w = p.we_wordv; for (i = 0; i < p.we_wordc; i++) printf("%s\n", w[i]); wordfree(&p); exit(EXIT_SUCCESS); }}} -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3264 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121011/c29bb9a1/attachment.bin From RB.Ezhilalan at hse.ie Thu Oct 11 09:58:50 2012 From: RB.Ezhilalan at hse.ie (RB. Ezhilalan (Principal Physicist, CUH)) Date: Thu, 11 Oct 2012 16:58:50 +0100 Subject: [torqueusers] job schedule and queus In-Reply-To: References: Message-ID: Hi All, I am new to Torque and have been using it for scheduling parallel montecarlo calculations over multiple CPUs on different PCs. I had set-up two new queues namely 'long' and 'short' besides the default queue 'batch'. I noticed when I submit multiple jobs without specifying any particular queue name then the pbs server/scheduler schedules those 7 jobs immediately as the default 'batch' queue. The qstat -q command immediately after submitting job shows that 7 jobs are running ( no '7' under 'R'). However, when I used either the 'short' or 'long' queue name to launch 7 jobs then the jobs go in to a queue. The qstat -q command in this instance shows that 1 job running and 6 jobs in queue but those jobs eventually got executed perhaps taking longer time to complete than running same jobs with the default queue 'batch'. Although all CPU were free on both cases, why in the case of default queue, the jobs were immediately scheduled where as this is not the case with the other queues. Could I get some advise on this? Many thanks, Ezhilalan Ramalingam M.Sc.,DABR., Principal Physicist (Radiotherapy), Medical Physics Department, Cork University Hospital, Wilton, Cork Ireland Tel. 00353 21 4922533 Fax.00353 21 4921300 Email: rb.ezhilalan at hse.ie -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: 11 October 2012 07:31 To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 99, Issue 9 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. login shells not run in torque 4 and bash -l (Charles Henry) 2. Maui is not submitting jobs to torque (Alessandra Forti) 3. Re: Maui is not submitting jobs to torque (Bas van der Vlies) ---------------------------------------------------------------------- Message: 1 Date: Wed, 10 Oct 2012 14:30:27 -0500 (CDT) From: Charles Henry Subject: [torqueusers] login shells not run in torque 4 and bash -l To: torqueusers at supercluster.org Message-ID: <135357755.673805.1349897427580.JavaMail.root at ittc.ku.edu> Content-Type: text/plain; charset=utf-8 Hi list, I have been following the torque 4 development, and I'm currently using torque 4.1.2 on RHEL6.2. I have found that I cannot get cluster jobs to run correctly without using "#!/bin/bash -l" in each script. A few sites (academic and government) are listing this workaround in their cluster FAQs. Our site uses mpi-selector and needs to source /etc/profile for every cluster job (interactive or not). I have looked for settings and even gone so far as reading the source code. The relevant settings are defined globally inside src/resmom/mom_main.c ... (line 205) int src_login_batch = TRUE; int src_login_interactive = TRUE; ... and in src/resmom/start_exec.c ... (line 3736) void source_login_shells_or_not( ... if (((TJE->is_interactive == TRUE) && (src_login_interactive == FALSE)) || ((TJE->is_interactive != TRUE) && (src_login_batch == FALSE))) ... Where those values are declared as "extern int", so the values from mom_main.c are accessible once the binaries are linked. There's no error message from the source_login_shells_or_not function, and the code looks very similar to the torque-3 code (except for being wrapped up into functions). Can anyone shed some light on the problem? Chuck ------------------------------ Message: 2 Date: Wed, 10 Oct 2012 20:30:42 +0100 From: Alessandra Forti Subject: [torqueusers] Maui is not submitting jobs to torque To: , Message-ID: <5075CCE2.1090300 at cern.ch> Keywords: CERN SpamKiller Note: -50 Content-Type: text/plain; charset="iso-8859-1" Hi, I have installed a mini test cluster with torque and maui. We have used maui/torque for years on our grid cluster and now we are upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination maui doesn't seem to work correctly. When I submit jobs and it behaves as if there weren't any free resources. Even when I tried to install only torque and maui with a bare minimum configuration I got the same behaviour, i.e. 1) When I submit the jobs just remain queued //[root@// maui]# /qstat -an1// // //: // //Req'd Req'd Elap// //Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time// //-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----// //10. aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- // //11.s aforti long pbs-vm3.sh -- -- -- -- -- Q -- -- / 2) If I run qrun the job runs so I assume the problem is not between torque server and torque mom. 3) When I use showq on the old versions displayed the WCLimit of the default queue now it displays 0 at first and then it changes it by itself to 100 days /showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 16 Processors Active (0.00%) 0 of 1 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 2 aforti Idle 1 99:23:59:59 Wed Oct 10 13:36:34 3 aforti Idle 1 99:23:59:59 Wed Oct 10 14:01:43 4 aforti Idle 1 99:23:59:59 Wed Oct 10 18:50:14 5 aforti Idle 1 00:00:00 Wed Oct 10 20:29:27 4 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 4 Active Jobs: 0 Idle Jobs: 4 Blocked Jobs: 0 // // //Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked Jobs: 0// / 4) Checkjob just tells me the job cannot be run in the default partition without any particular reason /[.....] PE: 1.00 StartPriority: 120// //cannot select job 10 for partition DEFAULT (Class)/ 5) Checknode can see the node free if it wasn't clear from other commands /[root@// maui]# !checkno// //checknode // // //checking node // // //State: Idle (in current state for 00:55:10)// //Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK: 1M// //Utilized Resources: SWAP: 202M// //Dedicated Resources: [NONE]// //Opsys: linux Arch: [NONE]// //Speed: 1.00 Load: 0.000// //Network: [DEFAULT]// //Features: [lcgpro]// //Attributes: [Batch]// //Classes: [DEFAULT 1:1]// // //Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10 (0.09%)// // //Reservations:// //NOTE: no reservations on node/ 6) When I use showbf -v though it says my nodes are blocked by reservations despite checknode clearly telling me there are no reservations on that node. In our local maui.cfg there is a reservation for 1 proc I'm not sure why it blocks the whole node /[root@// server_logs]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:08:59// // // 3 procs available with no timelimit// // //node is blocked by reservation sft.0.0 in INFINITY// / But to be sure I removed it and even when I remove the reservation and reduce the maui.cfg to the default version without anything in it it tells me the node is blocked by "reservation NONE in INFINITY" /[root@// maui]# showbf -v// //backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9 17:37:58// // // 16 procs available with no timelimit// // //node is blocked by reservation NONE in INFINITY// / If I increase the maui loglevel to 9 I hundreds of these messages /10/10 13:37:39 MRMCheckEvents()// //10/10 13:37:39 INFO: no PBS sched socket connections ready// //10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP)// //10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily unavailable)// //10/10 13:37:39 INFO: all clients connected. servicing requests// / which leaves me perplexed since in other places with a different log level it sees the jobs waiting on the server so somehow some comunication happens and other doesn't /10/10 20:27:24 INFO: job '2' Priority: 410// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 410(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '2' priority: 410.30// //10/10 20:27:24 MJobGetStartPriority(3,0,Priority,NULL)// //10/10 20:27:24 INFO: job '3' Priority: 385// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 385(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '3' priority: 385.30// //10/10 20:27:24 MJobGetStartPriority(4,0,Priority,NULL)// //10/10 20:27:24 INFO: job '4' Priority: 97// //10/10 20:27:24 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 97(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)// //10/10 20:27:24 INFO: job '4' priority: 97.17// / Thanks for any help here are the rpms I used /maui-3.3-4.el5// //maui-client-3.3-4.el5// //maui-server-3.3-4.el5// //torque-2.5.7-7.el5// //torque-client-2.5.7-7.el5// //torque-server-2.5.7-7.el5// //libtorque-2.5.7-7.el5// / the maui.cfg /# # MAUI configuration example # @(#)maui.cfg David Groep 20031015.1 # for MAUI version 3.2.5 # SERVERHOST / /ADMIN1 root ADMINHOST / /RMTYPE[0] PBS RMHOST[0] / /RMSERVER[0] / / SERVERPORT 40559 SERVERMODE NORMAL # Set PBS server polling interval. Since we have many short jobs # and want fast turn-around, set this to 10 seconds (default: 2 minutes) RMPOLLINTERVAL 00:00:10 # a max. 10 MByte log file in a logical location LOGFILE /var/log/maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3/ and Torque config /create queue long// //set queue long queue_type = Execution// //set queue long acl_hosts = localhost// //set queue long acl_hosts += // //set queue long resources_max.cput = 48:00:00// //set queue long resources_max.walltime = 72:00:00// //set queue long acl_group_enable = True// //set queue long acl_groups = aforti// //set queue long enabled = True// //set queue long started = True// //#// //# Set server attributes.// //#// //set server scheduling = True// //set server acl_host_enable = False// //set server acl_hosts = // //set server acl_hosts += localhost// //set server default_queue = long// //set server log_events = 511// //set server mail_from = adm// //set server next_job_number = 12/ -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121010/8 e5bc7cc/attachment-0001.html ------------------------------ Message: 3 Date: Thu, 11 Oct 2012 06:31:11 +0000 From: Bas van der Vlies Subject: Re: [torqueusers] Maui is not submitting jobs to torque To: Torque Users Mailing List Cc: "" Message-ID: <74EB35DC444C754DA400390F673C23D6AF6F17 at sara-exch-3.ka.sara.nl> Content-Type: text/plain; charset="iso-8859-1" On 10 okt. 2012, at 21:30, Alessandra Forti > wrote: node is blocked by reservation sft.0.0 in INFINITY This message informs that there is a reservation on the node. what is the output of showres? -- Bas van der Vlies basv at sara.nl ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 99, Issue 9 ****************************************** From Alessandra.Forti at cern.ch Thu Oct 11 14:23:10 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Thu, 11 Oct 2012 21:23:10 +0100 Subject: [torqueusers] Maui is not submitting jobs to torque In-Reply-To: References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> Message-ID: <50772AAE.1000505@cern.ch> Thanks if it is harmless I'll ignore it. It just comes out with the highest level of logging anyway. cheers alessandra On 11/10/2012 12:36, Jonathan Barber wrote: > On 10 October 2012 20:30, Alessandra Forti wrote: >> Hi, > [snip] > >> 10/10 13:37:39 MRMCheckEvents() >> 10/10 13:37:39 INFO: no PBS sched socket connections ready >> 10/10 13:37:39 MSUAcceptClient(6,ClientSD,HostName,TCP) >> 10/10 13:37:39 INFO: accept call failed, errno: 11 (Resource temporarily >> unavailable) >> 10/10 13:37:39 INFO: all clients connected. servicing requests >> >> which leaves me perplexed since in other places with a different log level >> it sees the jobs waiting on the server so somehow some comunication happens >> and other doesn't > I see these same messages from maui 3.3.1. It is probably not a > problem for you, but it I believe it is a small bug in Maui. > > The problem is that the socket has the flag O_NONBLOCK set. However, > when the MSUAcceptClient() call's accept() to see if a client is > connecting, it doesn't take this into account. > > I've attached a patch for the attention of the maintainers. It > quietens the output and works for the 3.3.1 branch and applies cleanly > against the trunk (so it also work there). > > Regards > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121011/0ac09d86/attachment-0001.html From Alessandra.Forti at cern.ch Thu Oct 11 14:26:04 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Thu, 11 Oct 2012 21:26:04 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> Message-ID: <50772B5C.3040302@cern.ch> Hi, yes, in one case there is a reservation and here is the output of that test installation /showres// //Reservations// // //ReservationID Type S Start End Duration N/P StartTime// // //sft.0.0 User - -1:12:07:14 INFINITY INFINITY 1/1 Wed Oct 10 09:01:02// // //1 reservation located/ but in the second installation I did removing all non essential there is no reservation and it still doesn't submit. /showres// //Reservations// // //ReservationID Type S Start End Duration N/P StartTime// // // //0 reservations located /I've now done the installation 3 times progressively simplifying and it always gives the same result. Here is another interesting snapshot of the log files where it tells me that the classes are not supported. It looks like a miscomunication with the pbs server but all the ports are opened. And some information passes through. /10/11 21:02:04 MLocalCheckFairnessPolicy(26,1349985724,Message)// //10/11 21:02:04 INFO: job '26' added to queue at slot 2// //10/11 21:02:04 INFO: total jobs selected in partition ALL: 3/3 // //10/11 21:02:04 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)// //10/11 21:02:04 MLocalCheckFairnessPolicy(NULL,1349985724,Message)// //10/11 21:02:04 INFO: checking job[0] '24'// //10/11 21:02:04 MJobCheckLimits(24,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 24 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[1] '25'// //10/11 21:02:04 MJobCheckLimits(25,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 25 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[2] '26'// //10/11 21:02:04 MJobCheckLimits(26,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 26 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: total jobs selected in partition DEFAULT: 0/3 [Class: 3]// //10/11 21:02:04 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)// //10/11 21:02:04 MLocalCheckFairnessPolicy(NULL,1349985724,Message)// //10/11 21:02:04 INFO: checking job[0] '24'// //10/11 21:02:04 MJobCheckLimits(24,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 24 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[1] '25'// //10/11 21:02:04 MJobCheckLimits(25,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 25 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: checking job[2] '26'// //10/11 21:02:04 MJobCheckLimits(26,SOFT,P,8,Message)// //10/11 21:02:04 INFO: job 26 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]')// //10/11 21:02:04 INFO: total jobs selected in partition DEFAULT: 0/3 [Class: 3]/ thanks for any help. cheers alessandra PS I know there is an elephant in the room, I just can't see it. :( On 11/10/2012 07:31, Bas van der Vlies wrote: > On 10 okt. 2012, at 21:30, Alessandra Forti > wrote: > > node is blocked by reservation sft.0.0 in INFINITY > > This message informs that there is a reservation on the node. what is the output of showres? > > -- > Bas van der Vlies > basv at sara.nl > > > > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121011/216fe4cc/attachment.html From tegner at renget.se Thu Oct 11 22:31:51 2012 From: tegner at renget.se (Jon Tegner) Date: Fri, 12 Oct 2012 06:31:51 +0200 Subject: [torqueusers] mpirun submit stopped working Message-ID: <50779D37.9040203@renget.se> Hi, using a combination of torque/maui, and typically we submit our jobs with a line like: mpirun -machinefile xxx -np xxx etc etc in the torque submitscript. Yesterday maui crashed (not sure if this is related), and after this the actual submit line had to be changed to ssh $(hostname) 'mpirun -machinefile xxx -np xxx etc etc' in order to start the job. We haven't (explicitly) changed anything, and I really don't understand this behaviour. Suggestions most welcome! Thanks, /jon From Alessandra.Forti at cern.ch Fri Oct 12 01:41:25 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Fri, 12 Oct 2012 08:41:25 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <50772B5C.3040302@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> Message-ID: <5077C9A5.9090403@cern.ch> Googling for parts of this /INFO: job 24 rejected, partition DEFAULT (classes not supported '[long 1:0][DEFAULT 0:1]') /I get the MPolicy code in a mauiuser thread of 2007 http://www.supercluster.org/pipermail/mauiusers/2007-March/002665.html ad another MPolicy.c similar (identical?) code. But this doesn't really help me. cheers alessandra On 11/10/2012 21:26, Alessandra Forti wrote: > Hi, > > yes, in one case there is a reservation and here is the output of that > test installation > > /showres// > //Reservations// > // > //ReservationID Type S Start End Duration > N/P StartTime// > // > //sft.0.0 User - -1:12:07:14 INFINITY INFINITY > 1/1 Wed Oct 10 09:01:02// > // > //1 reservation located/ > > but in the second installation I did removing all non essential there > is no reservation and it still doesn't submit. > > /showres// > //Reservations// > // > //ReservationID Type S Start End Duration > N/P StartTime// > // > // > //0 reservations located > > /I've now done the installation 3 times progressively simplifying and > it always gives the same result. > > Here is another interesting snapshot of the log files where it tells > me that the classes are not supported. It looks like a miscomunication > with the pbs server but all the ports are opened. And some > information passes through. > > /10/11 21:02:04 MLocalCheckFairnessPolicy(26,1349985724,Message)// > //10/11 21:02:04 INFO: job '26' added to queue at slot 2// > //10/11 21:02:04 INFO: total jobs selected in partition ALL: 3/3 // > //10/11 21:02:04 > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)// > //10/11 21:02:04 MLocalCheckFairnessPolicy(NULL,1349985724,Message)// > //10/11 21:02:04 INFO: checking job[0] '24'// > //10/11 21:02:04 MJobCheckLimits(24,SOFT,P,8,Message)// > //10/11 21:02:04 INFO: job 24 rejected, partition DEFAULT (classes > not supported '[long 1:0][DEFAULT 0:1]')// > //10/11 21:02:04 INFO: checking job[1] '25'// > //10/11 21:02:04 MJobCheckLimits(25,SOFT,P,8,Message)// > //10/11 21:02:04 INFO: job 25 rejected, partition DEFAULT (classes > not supported '[long 1:0][DEFAULT 0:1]')// > //10/11 21:02:04 INFO: checking job[2] '26'// > //10/11 21:02:04 MJobCheckLimits(26,SOFT,P,8,Message)// > //10/11 21:02:04 INFO: job 26 rejected, partition DEFAULT (classes > not supported '[long 1:0][DEFAULT 0:1]')// > //10/11 21:02:04 INFO: total jobs selected in partition DEFAULT: > 0/3 [Class: 3]// > //10/11 21:02:04 > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)// > //10/11 21:02:04 MLocalCheckFairnessPolicy(NULL,1349985724,Message)// > //10/11 21:02:04 INFO: checking job[0] '24'// > //10/11 21:02:04 MJobCheckLimits(24,SOFT,P,8,Message)// > //10/11 21:02:04 INFO: job 24 rejected, partition DEFAULT (classes > not supported '[long 1:0][DEFAULT 0:1]')// > //10/11 21:02:04 INFO: checking job[1] '25'// > //10/11 21:02:04 MJobCheckLimits(25,SOFT,P,8,Message)// > //10/11 21:02:04 INFO: job 25 rejected, partition DEFAULT (classes > not supported '[long 1:0][DEFAULT 0:1]')// > //10/11 21:02:04 INFO: checking job[2] '26'// > //10/11 21:02:04 MJobCheckLimits(26,SOFT,P,8,Message)// > //10/11 21:02:04 INFO: job 26 rejected, partition DEFAULT (classes > not supported '[long 1:0][DEFAULT 0:1]')// > //10/11 21:02:04 INFO: total jobs selected in partition DEFAULT: > 0/3 [Class: 3]/ > > thanks for any help. > > cheers > alessandra > > PS I know there is an elephant in the room, I just can't see it. :( > > On 11/10/2012 07:31, Bas van der Vlies wrote: >> On 10 okt. 2012, at 21:30, Alessandra Forti > wrote: >> >> node is blocked by reservation sft.0.0 in INFINITY >> >> This message informs that there is a reservation on the node. what is the output of showres? >> >> -- >> Bas van der Vlies >> basv at sara.nl >> >> >> >> _______________________________________________ >> mauiusers mailing list >> mauiusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/mauiusers > > > -- > Facts aren't facts if they come from the wrong people. (Paul Krugman) -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121012/870c8d9a/attachment.html From jonathan.barber at gmail.com Fri Oct 12 03:17:17 2012 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Fri, 12 Oct 2012 10:17:17 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <5077C9A5.9090403@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> Message-ID: On 12 October 2012 08:41, Alessandra Forti wrote: > Googling for parts of this > > INFO: job 24 rejected, partition DEFAULT (classes not supported '[long > 1:0][DEFAULT 0:1]') > > I get the MPolicy code in a mauiuser thread of 2007 > > http://www.supercluster.org/pipermail/mauiusers/2007-March/002665.html > > ad another MPolicy.c similar (identical?) code. > > But this doesn't really help me. I am just starting to use torque, but I think your acl_hosts is wrong. I created your torque queue and played around with changing this, and if it doesn't include the execution host, then I get the same behavior you see. I would guess that torque doesn't like the "localhost"? I suggest you try removing all of the acl_hosts + acl_host_enable settings and see if it works. Then enable acl_host_enable and add hosts one at a time to acl_hosts to see what makes it work. p.s. are the source RPMs available for the Maui RPMs that you are using? I've just packaged maui and would like to see how you've done it. Cheers > cheers > alessandra -- Jonathan Barber From Alessandra.Forti at cern.ch Fri Oct 12 04:46:20 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Fri, 12 Oct 2012 11:46:20 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> Message-ID: <5077F4FC.7000705@cern.ch> Thank you! I ported that from the old configuration. :/ I'll have to clean that up. We get the rpms from the project (EMI) repository. You might want to give a look at their twiki https://twiki.cern.ch/twiki/bin/view/EMI/EMI-2 cheers alessandra PS I knew there was an elephant. n 12/10/2012 10:17, Jonathan Barber wrote: > On 12 October 2012 08:41, Alessandra Forti wrote: >> Googling for parts of this >> >> INFO: job 24 rejected, partition DEFAULT (classes not supported '[long >> 1:0][DEFAULT 0:1]') >> >> I get the MPolicy code in a mauiuser thread of 2007 >> >> http://www.supercluster.org/pipermail/mauiusers/2007-March/002665.html >> >> ad another MPolicy.c similar (identical?) code. >> >> But this doesn't really help me. > I am just starting to use torque, but I think your acl_hosts is wrong. > > I created your torque queue and played around with changing this, and > if it doesn't include the execution host, then I get the same behavior > you see. I would guess that torque doesn't like the "localhost"? > > I suggest you try removing all of the acl_hosts + acl_host_enable > settings and see if it works. Then enable acl_host_enable and add > hosts one at a time to acl_hosts to see what makes it work. > > p.s. are the source RPMs available for the Maui RPMs that you are > using? I've just packaged maui and would like to see how you've done > it. > > Cheers > >> cheers >> alessandra -- Facts aren't facts if they come from the wrong people. (Paul Krugman) From mej at lbl.gov Fri Oct 12 10:13:18 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 12 Oct 2012 09:13:18 -0700 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <5077F4FC.7000705@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> <5077F4FC.7000705@cern.ch> Message-ID: <20121012161315.GV8827@lbl.gov> On Friday, 12 October 2012, at 11:46:20 (+0100), Alessandra Forti wrote: > Thank you! > > I ported that from the old configuration. :/ I'll have to clean that up. > > We get the rpms from the project (EMI) repository. You might want to > give a look at their twiki > > https://twiki.cern.ch/twiki/bin/view/EMI/EMI-2 You may want to check with the Adaptive folks on this, but according to my reading of the Maui license, it is free *cost-wise* but is NOT free or open source software. There are restrictions of which you may run afoul if handing out SRPMs. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From Alessandra.Forti at cern.ch Fri Oct 12 10:31:01 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Fri, 12 Oct 2012 17:31:01 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <20121012161315.GV8827@lbl.gov> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> <5077F4FC.7000705@cern.ch> <20121012161315.GV8827@lbl.gov> Message-ID: <507845C5.8030204@cern.ch> From the adaptive computing site: http://www.adaptivecomputing.com/products/open-source/maui/ " /The software license allows end-user organizations to freely use, support, modify, and distribute the code for non-commercial purposes." / cheers alessandra On 12/10/2012 17:13, Michael Jennings wrote: > On Friday, 12 October 2012, at 11:46:20 (+0100), > Alessandra Forti wrote: > >> Thank you! >> >> I ported that from the old configuration. :/ I'll have to clean that up. >> >> We get the rpms from the project (EMI) repository. You might want to >> give a look at their twiki >> >> https://twiki.cern.ch/twiki/bin/view/EMI/EMI-2 > You may want to check with the Adaptive folks on this, but according > to my reading of the Maui license, it is free *cost-wise* but is NOT > free or open source software. There are restrictions of which you may > run afoul if handing out SRPMs. > > Michael > -- Facts aren't facts if they come from the wrong people. (Paul Krugman) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121012/72d10ce3/attachment.html From mej at lbl.gov Fri Oct 12 10:40:26 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 12 Oct 2012 09:40:26 -0700 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <507845C5.8030204@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> <5077F4FC.7000705@cern.ch> <20121012161315.GV8827@lbl.gov> <507845C5.8030204@cern.ch> Message-ID: <20121012164022.GW8827@lbl.gov> On Friday, 12 October 2012, at 17:31:01 (+0100), Alessandra Forti wrote: > From the adaptive computing site: > > http://www.adaptivecomputing.com/products/open-source/maui/ > > " /The software license allows end-user organizations to freely use, > support, modify, and distribute the code for non-commercial > purposes." >From maui-3.3.1/LICENSE: 4. Distribution 'End User' organizations that are academic and government agencies may redistribute this SOFTWARE subject to the condition that the distribution contains conspicuous publication of the acknowledgement statement found within the LICENSE agreement distributed with this SOFTWARE. Organizations that are not academic and government agencies including commercial and other for-profit organizations may not redistribute this code or derivations of this code in any form whatsoever, including parts of SOFTWARE incorporated into other software programs without express written permission from Cluster Resources, Inc. Redistribution of the SOFTWARE in any form whatsoever, including parts of the code that are incorporated into other software programs, must include a conspicuous and appropriate publication of the following acknowledgement: 'This product was developed by Cluster Resources, Inc. Moab Scheduling System is a trademark of Cluster Resources, Inc.' Any redistribution or modification of the SOFTWARE must, when installed, display the above language, the copyright notice, and the warranty disclaimer. Thus, you should check with them on the actual license terms before redistributing. :-) HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From Alessandra.Forti at cern.ch Fri Oct 12 11:26:37 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Fri, 12 Oct 2012 18:26:37 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <20121012164022.GW8827@lbl.gov> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> <5077F4FC.7000705@cern.ch> <20121012161315.GV8827@lbl.gov> <507845C5.8030204@cern.ch> <20121012164022.GW8827@lbl.gov> Message-ID: <507852CD.8060400@cern.ch> Personally I really don't need to do anything. EMI is a public project funded by the EU that distributes software to hundreds of sites. I expect they have taken care of this before putting the source code in equally public repositories. cheers alessandra On 12/10/2012 17:40, Michael Jennings wrote: > On Friday, 12 October 2012, at 17:31:01 (+0100), > Alessandra Forti wrote: > >> From the adaptive computing site: >> >> http://www.adaptivecomputing.com/products/open-source/maui/ >> >> " /The software license allows end-user organizations to freely use, >> support, modify, and distribute the code for non-commercial >> purposes." > >From maui-3.3.1/LICENSE: > > 4. Distribution > > 'End User' organizations that are academic and government agencies may > redistribute this SOFTWARE subject to the condition that the distribution > contains conspicuous publication of the acknowledgement statement found > within the LICENSE agreement distributed with this SOFTWARE. > > Organizations that are not academic and government agencies including > commercial and other for-profit organizations may not redistribute this code > or derivations of this code in any form whatsoever, including parts of > SOFTWARE incorporated into other software programs without express written > permission from Cluster Resources, Inc. > > Redistribution of the SOFTWARE in any form whatsoever, including parts of > the code that are incorporated into other software programs, must include a > conspicuous and appropriate publication of the following acknowledgement: > > 'This product was developed by Cluster Resources, Inc. Moab Scheduling > System is a trademark of Cluster Resources, Inc.' > > Any redistribution or modification of the SOFTWARE must, when installed, > display the above language, the copyright notice, and the warranty > disclaimer. > > > Thus, you should check with them on the actual license terms before > redistributing. :-) > > HTH, > Michael > -- Facts aren't facts if they come from the wrong people. (Paul Krugman) From nt_mahmood at yahoo.com Fri Oct 12 12:54:37 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Fri, 12 Oct 2012 11:54:37 -0700 (PDT) Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) Message-ID: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> Dear all, Below is our procedure to configure torque 4.1. However at the end we got an error. 1- compile and install torque 2- put the server hostname (archie) in the /var/spool/torque/server_name 3- put the server hostname in the /var/spool/torque/server_priv/nodes file. This is a shared memeory machine so the server and client are the same machine 4- run the command "pbs_server -t create" to setup the pbs_server 5- run the command "qmgr -c 'p s'" and the following error is reported Error communicating with archie(192.160.1.100) Cannot connect to default server host 'archie' - check pbs_server daemon and/or trqauthd. qmgr: cannot connect to server? (errno=111) Connection refused 6- starting trqauthd does not change the state and the error is reported again on submitting qmgr -c 'p s' Any comment is appreciated. Regards, Mahmood From l.flis at cyf-kr.edu.pl Sat Oct 13 18:32:07 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Sun, 14 Oct 2012 02:32:07 +0200 Subject: [torqueusers] torque pbs_mom segfaults Message-ID: <507A0807.8060304@cyf-kr.edu.pl> Dear All, We have observed few pbs_mom crashes which are related to mom-to-mom communication. We haven't managed to replicate this issue but it seems it is related to applications which are using TM interface on multiple nodes (OpenMPI) and one of the processes segfaults. Our torque version affected by this bug is: 2.5.12 We have filled support ticket for moab/torque, however I'd like to hear from you if you have ever encountered such an error. Please find the text file with more details in the attachment. It's worth to note that even if your pbs_mom or server has crashed with segfault and didn't dump core file - it is still possible to locate place in the code where bad happened. Just use /proc//smaps of running mom to find which library or program owns the page where RIP/EIP is pointing. Calculate relative rip/eip address and use addr2line to find out line of code where program crashed. Binaries with debug symbols will be needed for that (torque-debug package is sufficient) Regards, -- Lukasz Flis -------------- next part -------------- A non-text attachment was scrubbed... Name: torque-mom-crash-c.log Type: text/x-log Size: 10262 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121014/e1b6bf1e/attachment-0001.bin From nt_mahmood at yahoo.com Sun Oct 14 02:22:25 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sun, 14 Oct 2012 01:22:25 -0700 (PDT) Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) In-Reply-To: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> References: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> Message-ID: <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> We stuck at this point. Any tip is welcomed. ? Regards, Mahmood ----- Original Message ----- From: Mahmood Naderan To: torque cluster Cc: Sent: Friday, October 12, 2012 8:54 PM Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) Dear all, Below is our procedure to configure torque 4.1. However at the end we got an error. 1- compile and install torque 2- put the server hostname (archie) in the /var/spool/torque/server_name 3- put the server hostname in the /var/spool/torque/server_priv/nodes file. This is a shared memeory machine so the server and client are the same machine 4- run the command "pbs_server -t create" to setup the pbs_server 5- run the command "qmgr -c 'p s'" and the following error is reported Error communicating with archie(192.160.1.100) Cannot connect to default server host 'archie' - check pbs_server daemon and/or trqauthd. qmgr: cannot connect to server? (errno=111) Connection refused 6- starting trqauthd does not change the state and the error is reported again on submitting qmgr -c 'p s' Any comment is appreciated. Regards, Mahmood _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From wytsang at clustertech.com Mon Oct 15 01:31:33 2012 From: wytsang at clustertech.com (Clotho Tsang) Date: Mon, 15 Oct 2012 15:31:33 +0800 Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) In-Reply-To: <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> References: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> Message-ID: You need to kill pbs_server with "kill -9". It does not shut down correctly. On 14 October 2012 16:22, Mahmood Naderan wrote: > We stuck at this point. Any tip is welcomed. > > > Regards, > Mahmood > > > > ----- Original Message ----- > From: Mahmood Naderan > To: torque cluster > Cc: > Sent: Friday, October 12, 2012 8:54 PM > Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default > server) > > Dear all, > > > Below is our procedure to configure torque 4.1. However at the end we got > an error. > > 1- compile and install torque > 2- put the server hostname (archie) in the /var/spool/torque/server_name > 3- put the server hostname in the /var/spool/torque/server_priv/nodes > file. This is a shared memeory machine so the server and client are the > same machine > 4- run the command "pbs_server -t create" to setup the pbs_server > 5- run the command "qmgr -c 'p s'" and the following error is reported > > Error communicating with archie(192.160.1.100) > Cannot connect to default server host 'archie' - check pbs_server daemon > and/or trqauthd. > qmgr: cannot connect to server (errno=111) Connection refused > > 6- starting trqauthd does not change the state and the error is reported > again on submitting qmgr -c 'p s' > > > Any comment is appreciated. > > Regards, > Mahmood > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: clotho at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121015/005b8acf/attachment.html From jonathan.barber at gmail.com Mon Oct 15 05:07:47 2012 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Mon, 15 Oct 2012 12:07:47 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: <5077F4FC.7000705@cern.ch> References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> <5077F4FC.7000705@cern.ch> Message-ID: On 12 October 2012 11:46, Alessandra Forti wrote: > Thank you! > > I ported that from the old configuration. :/ I'll have to clean that up. > > We get the rpms from the project (EMI) repository. You might want to > give a look at their twiki > > https://twiki.cern.ch/twiki/bin/view/EMI/EMI-2 Hmm, I see the RPM in their third-party repo: http://emisoft.web.cern.ch/emisoft/dist/EMI/testing/2/sl6/x86_64/third-party/maui-3.3-4.el6.x86_64.rpm But no corresponding SRPMs. Maybe they don't ship them to avoid the kind of licensing issues discussed later in this thread. Oh well. Cheers -- Jonathan Barber From Alessandra.Forti at cern.ch Mon Oct 15 05:55:40 2012 From: Alessandra.Forti at cern.ch (Alessandra Forti) Date: Mon, 15 Oct 2012 12:55:40 +0100 Subject: [torqueusers] [Mauiusers] Maui is not submitting jobs to torque In-Reply-To: References: <50745914.3050601@cern.ch> <5075CCE2.1090300@cern.ch> <74EB35DC444C754DA400390F673C23D6AF6F17@sara-exch-3.ka.sara.nl> <50772B5C.3040302@cern.ch> <5077C9A5.9090403@cern.ch> <5077F4FC.7000705@cern.ch> Message-ID: <507BF9BC.4040804@cern.ch> Hi, it is possible. I never really used the source. I thought maui-devel might contain it. cheers alessandra On 15/10/2012 12:07, Jonathan Barber wrote: > On 12 October 2012 11:46, Alessandra Forti wrote: >> Thank you! >> >> I ported that from the old configuration. :/ I'll have to clean that up. >> >> We get the rpms from the project (EMI) repository. You might want to >> give a look at their twiki >> >> https://twiki.cern.ch/twiki/bin/view/EMI/EMI-2 > Hmm, I see the RPM in their third-party repo: > http://emisoft.web.cern.ch/emisoft/dist/EMI/testing/2/sl6/x86_64/third-party/maui-3.3-4.el6.x86_64.rpm > > But no corresponding SRPMs. Maybe they don't ship them to avoid the > kind of licensing issues discussed later in this thread. Oh well. > > Cheers -- Facts aren't facts if they come from the wrong people. (Paul Krugman) From rosmond at reachone.com Mon Oct 15 12:58:04 2012 From: rosmond at reachone.com (Tom Rosmond) Date: Mon, 15 Oct 2012 11:58:04 -0700 Subject: [torqueusers] upgrading from 3.0.2 to 4.1.2 with NUMA support Message-ID: <1350327484.4318.47.camel@cedar.reachone.com> I have been successfully running Torque version 3.0.2 for several months on a 2 NUMA node workstation. Recently I decided to try upgrading to 4.1.2. I essentially duplicated my 3.0.2 setup, i.e. the same 'configure' options, the same 'server_priv/nodes' and 'mom_priv/mom_layout' files. Here are those details: ./configure --prefix=/opt/torque --enable-numa-support --enable-libcpuset 'nodes' file fir.reachone.com np=32 num_numa_nodes=2 'mom_layout' file cpus=0-15 mem=0 cpus=16-31 mem=1 As I said, these are identical to what I used to successfully configure with 3.0.2. Yet when I try to start 'pbs_mom', I get this: -------------------------------------------------------------- root at fir:~# /opt/torque/sbin/pbs_mom pbs_mom: LOG_ERROR::No such file or directory (2) in read_layout_file, Unable to read the layout file in /var/spool/torque/mom_priv/mom.layout pbs_mom: LOG_ERROR::setup_nodeboards, Could not read layout file! ----------------------------------------------------------------- The other daemons (pbs_server, trqauthd) start successfully, so there must be something different vis-a-vis pbs_mom for NUMA configuration between 3.0.2 and 4.1.2. I have looked carefully at 'config.log' and everything seems normal. And the 'mom.layout' file is clearly present. Any suggestions? T. Rosmond From rosmond at reachone.com Mon Oct 15 15:36:26 2012 From: rosmond at reachone.com (Tom Rosmond) Date: Mon, 15 Oct 2012 14:36:26 -0700 Subject: [torqueusers] upgrading from 3.0.2 to 4.1.2 with NUMA support In-Reply-To: <1350327484.4318.47.camel@cedar.reachone.com> References: <1350327484.4318.47.camel@cedar.reachone.com> Message-ID: <1350336986.4318.92.camel@cedar.reachone.com> I see one error myself: I have 'mom_layout' instead of 'mom.layout'. However now I just get --------------------------------------------------------------- 10/15/2012 14:30:19;0002; pbs_mom.27442;Svr;pbs_mom;Torque Mom Version = 4.1.2, loglevel = 0 10/15/2012 14:30:30;0002; pbs_mom.27442;Svr;setup_program_environment;machine topology contains 2 memory nodes, 32 cpus 10/15/2012 14:30:30;0001; pbs_mom.27442;Svr;pbs_mom;LOG_ERROR::read_layout_file, nodeboard 0 has no nodeset 10/15/2012 14:30:30;0001; pbs_mom.27442;Svr;pbs_mom;LOG_ERROR::setup_nodeboards, Could not read layout file! --------------------------------------------------------------- So something I am still missing something somewhere. What is 'read_layout_file'? T. Rosmond On Mon, 2012-10-15 at 11:58 -0700, Tom Rosmond wrote: > I have been successfully running Torque version 3.0.2 for several months > on a 2 NUMA node workstation. Recently I decided to try upgrading to > 4.1.2. I essentially duplicated my 3.0.2 setup, i.e. the same > 'configure' options, the same 'server_priv/nodes' and > 'mom_priv/mom_layout' files. Here are those details: > > ./configure --prefix=/opt/torque --enable-numa-support > --enable-libcpuset > > 'nodes' file > fir.reachone.com np=32 num_numa_nodes=2 > > 'mom_layout' file > cpus=0-15 mem=0 > cpus=16-31 mem=1 > > > As I said, these are identical to what I used to successfully configure > with 3.0.2. > > Yet when I try to start 'pbs_mom', I get this: > > -------------------------------------------------------------- > > root at fir:~# /opt/torque/sbin/pbs_mom > pbs_mom: LOG_ERROR::No such file or directory (2) in read_layout_file, > Unable to read the layout file in /var/spool/torque/mom_priv/mom.layout > > pbs_mom: LOG_ERROR::setup_nodeboards, Could not read layout file! > > ----------------------------------------------------------------- > > The other daemons (pbs_server, trqauthd) start successfully, so there > must be something different vis-a-vis pbs_mom for NUMA configuration > between 3.0.2 and 4.1.2. I have looked carefully at 'config.log' and > everything seems normal. And the 'mom.layout' file is clearly present. > Any suggestions? > > T. Rosmond > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ianm at uchicago.edu Wed Oct 17 09:02:27 2012 From: ianm at uchicago.edu (Ian Miller) Date: Wed, 17 Oct 2012 15:02:27 +0000 Subject: [torqueusers] performance issues with maui & torque Message-ID: <843FE493E7B6CA42A6C4682D63AE2D950202917C@XM-MBX-02-PROD.ad.uchicago.edu> Hi I have maui verison 3.3.1 and touque version 2.5.7 and I seem to have a few nodes sitting idle that should be running jobs. They have been able to run jobs in the past but the cluster has never run at 80-90% The output of showq is as follows (I omitted the jobs lists) 119 Active Jobs 130 of 344 Processors Active (37.79%) 15 of 35 Nodes Active (42.86%) Total Jobs: 467 Active Jobs: 119 Idle Jobs: 0 Blocked Jobs: 348 When I try to force run a job.. I get ?. root at beast$ qrun 209054 qrun: Execution server rejected request MSG=cannot send job to mom, state=PRERUN 209054.beast-net 30 out of the 34 worker nodes at in one queue (batch) with 2 out of the 30 shared between another queue. Currently 33 of the total jobs (467) are in a different queue (short) and are running fine, the reset are in the default(batch). My question is how can I get the idle nodes to run this jobs? What might be the problem? Qmgr: print queue batch # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 200 set queue batch resources_default.neednodes = batch set queue batch resources_default.nodes = 1 set queue batch max_user_run = 150 set queue batch keep_completed = 300 set queue batch enabled = True set queue batch started = True # maui.cfg 3.3.1 SERVERHOST beast # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[BEAST] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='0.01*AMEM - 2*LOAD' NODEAVAILABILITYPOLICY COMBINED:MEM SRCFG[Reinitz] HOSTLIST=minion1[2-9] SRCFG[Reinitz] GROUPLIST=Reinitz # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html USERCFG[DEFAULT] MAXIJOB=2000 # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Ian Miller Research Computing Administrator ianm at uchicago.edu (312) 402-6170 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121017/5f19300f/attachment-0001.html From nt_mahmood at yahoo.com Wed Oct 17 13:31:43 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 17 Oct 2012 12:31:43 -0700 (PDT) Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) In-Reply-To: References: <1350068077.15105.YahooMailNeo@web111708.mail.gq1.yahoo.com> <1350202945.63678.YahooMailNeo@web111719.mail.gq1.yahoo.com> Message-ID: <1350502303.80632.YahooMailNeo@web111709.mail.gq1.yahoo.com> Ok thanks for your help ? Regards, Mahmood ________________________________ From: Clotho Tsang To: Mahmood Naderan ; Torque Users Mailing List Cc: maui Sent: Monday, October 15, 2012 9:31 AM Subject: Re: [torqueusers] problem with torque 4.1 (Cannot connect to default server) You need to kill pbs_server with "kill -9". It does not shut down correctly. On 14 October 2012 16:22, Mahmood Naderan wrote: We stuck at this point. Any tip is welcomed. > >? >Regards, >Mahmood > > > > >----- Original Message ----- >From: Mahmood Naderan >To: torque cluster >Cc: >Sent: Friday, October 12, 2012 8:54 PM >Subject: [torqueusers] problem with torque 4.1 (Cannot connect to default server) > >Dear all, > > >Below is our procedure to configure torque 4.1. However at the end we got an error. > >1- compile and install torque >2- put the server hostname (archie) in the /var/spool/torque/server_name >3- put the server hostname in the /var/spool/torque/server_priv/nodes file. This is a shared memeory machine so the server and client are the same machine >4- run the command "pbs_server -t create" to setup the pbs_server >5- run the command "qmgr -c 'p s'" and the following error is reported > >Error communicating with archie(192.160.1.100) >Cannot connect to default server host 'archie' - check pbs_server daemon and/or trqauthd. >qmgr: cannot connect to server? (errno=111) Connection refused > >6- starting trqauthd does not change the state and the error is reported again on submitting qmgr -c 'p s' > > >Any comment is appreciated. > >Regards, >Mahmood >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers > -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: clotho at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121017/58f3543a/attachment.html From nt_mahmood at yahoo.com Wed Oct 17 13:32:03 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 17 Oct 2012 12:32:03 -0700 (PDT) Subject: [torqueusers] low network utilization Message-ID: <1350502323.41278.YahooMailNeo@web111703.mail.gq1.yahoo.com> Dear all, I have noticed that when I submit a job on a working node, the network speed is about 20Mb. That is quite slow because the switch speed is 1000Mb.? That causes the processes to be in "D" state and the cpu usages are much below 100%. I thought there is a problem with NFS however the stats shows about 1.3k requests per second which is not really high. Maybe Torque transfers data (from worker to server which has disks) quickly. How can I investigate more? ? Regards, Mahmood -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121017/acbdf99b/attachment.html From wytsang at clustertech.com Wed Oct 17 19:03:03 2012 From: wytsang at clustertech.com (Clotho Tsang) Date: Thu, 18 Oct 2012 09:03:03 +0800 Subject: [torqueusers] Torque 4.1.2 does not accept hostname with '-' Message-ID: The following problem is found at Torque 4.1.2, but not 4.1.0. At RHEL6, if the headnode hostname consists of char "-", jobs will keep running but not stop, checkjob shows message "cannot start job - RM failure, rc: 15033, msg: 'End of File' " The problem is not found if the hostname has no "-". ======================================= [root at hp-mgmt-1 maui]# checkjob 3309 checking job 3309 State: Running Creds: user:huadi group:group1 class:batch qos:DEFAULT WallTime: 00:03:07 of 1:00:00 SubmitTime: Thu Oct 11 11:14:00 (Time Queued Total: 00:00:02 Eligible: 00:00:02) StartTime: Thu Oct 11 11:14:02 StartDate: -00:03:08 Thu Oct 11 11:14:03 Total Tasks: 5 Req[0] TaskCount: 5 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Allocated Nodes: [hp-compute-19:5] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '3309' (-00:03:09 -> 00:56:51 Duration: 1:00:00) Messages: cannot start job - RM failure, rc: 15033, msg: 'End of File' PE: 5.00 StartPriority: 1 -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: clotho at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121018/b7e2ad4c/attachment.html From jonathan.barber at gmail.com Thu Oct 18 03:09:23 2012 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Thu, 18 Oct 2012 10:09:23 +0100 Subject: [torqueusers] low network utilization In-Reply-To: <1350502323.41278.YahooMailNeo@web111703.mail.gq1.yahoo.com> References: <1350502323.41278.YahooMailNeo@web111703.mail.gq1.yahoo.com> Message-ID: On 17 October 2012 20:32, Mahmood Naderan wrote: > Dear all, > I have noticed that when I submit a job on a working node, the network speed > is about 20Mb. That is quite slow because the switch speed is 1000Mb. That > causes the processes to be in "D" state and the cpu usages are much below > 100%. This sounds like you are generating more IOPS than your storage system can deliver, probably because you are doing many small random requests. You should first check that the server NIC and the switch port are both running at 1GbE (using "ethtool" on the host and connecting to the switch and verifying the port status). On the NFS server (assuming linux) check the block device that supports the NFS exported file system with "iostat -kx 1". If you have ~100% in the "%util" column then you are limited by the storage system. You can monitor the host network throughput with "iftop" (assuming linux). You can get a crude idea of your baseline NFS performance by using dd with large (larger than the largest amount of memory available to the server and client) files and reading / writing them from the client. For better measurements, I suggest fio: http://freecode.com/projects/fio although it is a lot more complicated to interpret the results. Cheers > I thought there is a problem with NFS however the stats shows about 1.3k > requests per second which is not really high. > Maybe Torque transfers data (from worker to server which has disks) quickly. > > How can I investigate more? > > Regards, > Mahmood > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Jonathan Barber From sm4082 at nyu.edu Thu Oct 18 07:58:41 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 18 Oct 2012 09:58:41 -0400 Subject: [torqueusers] resources_used. mem problems Message-ID: Hello, We have Torque 2.5.12 on one of our new cluster. OS is Red Hat Enterprise Linux Server release 6.2 (Santiago). We installed OpenMPI version 1.4.5 (compiled with intel compilers). Strangely, with our parallel jobs that are using OpenMPI 1.4.5 are reporting resources_used. men as a sum of the memory being used on all the nodes in the job in stead of reporting the memory that's being used just on mother superior node (rank=0). But if we run the same job with MVAPICH2 then we are seeing the values only from the node with rank=0 for resources_used.mem. Where as on our old clusters, with version1.4.3 and Torque 2.5.11 we are seeing the values just from mother superior node (rank=0). Overall, this is very problematic because we ask Moab/Torque to kill the jobs that use the memory more than they requested or are allocated. We use qsub wrapper to define memory for each and every job just to avoid node crashing, etc, etc. Since it is reporting all the memory that's being used on all the nodes (let's say 100 nodes), the sum is huge and it's way bigger than the memory on each individual node and so job is getting killed saying that it has exceeded the memory allocated. Has anyone seen this behavior on your clusters? Given that it is working fine with MVAPICH2 I'm thinking it has to do with OpenMPI 1.4.5 (as it works fine with 1.4.3). We are testing 1.4.3 on our new clusters and plan to test 1.4.5 on our old clusters. But I thought it'd be useful to know whether anyone has any thoughts on it. Please let me know. Thanks, Sreedhar. From brockp at umich.edu Thu Oct 18 08:08:49 2012 From: brockp at umich.edu (Brock Palen) Date: Thu, 18 Oct 2012 10:08:49 -0400 Subject: [torqueusers] resources_used. mem problems In-Reply-To: References: Message-ID: <2CAA67FA-5036-4300-A3CC-B8CFB287D29B@umich.edu> Sreedhar, Check that you enabled TM support in your OpenMPI build: We are running OMPI 1.6 but here is what ompi_info shows us: ompi_info | grep tm MCA ras: tm (MCA v2.0, API v2.0, Component v1.6) MCA plm: tm (MCA v2.0, API v2.0, Component v1.6) MCA ess: tm (MCA v2.0, API v2.0, Component v1.6) Thus with TM enabled mpirun for openMPI will use the sister moms to start the ranks on the other nodes. You can see this with pstree, If you look at your mpich2 jobs if the sister moms don't show processes but you see rather sshd -- hydra_proxy -- mympiprocess Your mpiexec for mpich2 is not using TM to start the jobs. The simplest route for this is to use mpiexec from osc and not use the mpiexec that comes with mpich2: https://www.osc.edu/~djohnson/mpiexec/ Though I think the hydra luancher in mpich2 added tm bootstrap support see: http://wiki.mcs.anl.gov/mpich2/index.php/Using_the_Hydra_Process_Manager I think you might want to jump on the mpich2 list and ask about PBS TM support. Also note if you go to torque 4 currently the tm+mpiexec (osc) stuff is all broken, stick with 2.5 for a few more months. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Oct 18, 2012, at 9:58 AM, Sreedhar Manchu wrote: > Hello, > > We have Torque 2.5.12 on one of our new cluster. OS is Red Hat Enterprise Linux Server release 6.2 (Santiago). We installed OpenMPI version 1.4.5 (compiled with intel compilers). > > Strangely, with our parallel jobs that are using OpenMPI 1.4.5 are reporting resources_used. men as a sum of the memory being used on all the nodes in the job in stead of reporting the memory that's being used just on mother superior node (rank=0). But if we run the same job with MVAPICH2 then we are seeing the values only from the node with rank=0 for resources_used.mem. Where as on our old clusters, with version1.4.3 and Torque 2.5.11 we are seeing the values just from mother superior node (rank=0). > > Overall, this is very problematic because we ask Moab/Torque to kill the jobs that use the memory more than they requested or are allocated. We use qsub wrapper to define memory for each and every job just to avoid node crashing, etc, etc. Since it is reporting all the memory that's being used on all the nodes (let's say 100 nodes), the sum is huge and it's way bigger than the memory on each individual node and so job is getting killed saying that it has exceeded the memory allocated. > > Has anyone seen this behavior on your clusters? Given that it is working fine with MVAPICH2 I'm thinking it has to do with OpenMPI 1.4.5 (as it works fine with 1.4.3). We are testing 1.4.3 on our new clusters and plan to test 1.4.5 on our old clusters. But I thought it'd be useful to know whether anyone has any thoughts on it. Please let me know. > > Thanks, > Sreedhar. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From tbaer at utk.edu Thu Oct 18 08:09:01 2012 From: tbaer at utk.edu (Troy Baer) Date: Thu, 18 Oct 2012 10:09:01 -0400 Subject: [torqueusers] resources_used. mem problems In-Reply-To: References: Message-ID: <1350569341.15740.762.camel@browncoat.jics.utk.edu> On Thu, 2012-10-18 at 09:58 -0400, Sreedhar Manchu wrote: > Has anyone seen this behavior on your clusters? Given that it is > working fine with MVAPICH2 I'm thinking it has to do with OpenMPI > 1.4.5 (as it works fine with 1.4.3). We are testing 1.4.3 on our new > clusters and plan to test 1.4.5 on our old clusters. But I thought > it'd be useful to know whether anyone has any thoughts on it. Please > let me know. It sounds to me that OpenMPI is doing the right thing here, in that it's launching processes through the TORQUE TM API so that its resource usage is being accounting accurately. OTOH, I'm guessing that your MVAPICH2 install is using either rsh or ssh to start remote processes, which does *NOT* handle resource usage accounting (or signal delivery) correctly. I would recommend getting your MVAPICH2 install to use the TM API to launch processes, either using the mpiexec.hydra script that likely comes with MVAPICH2 or using OSC mpiexec [1]. [1] https://www.osc.edu/~djohnson/mpiexec/index.php --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From sm4082 at nyu.edu Thu Oct 18 09:59:16 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 18 Oct 2012 11:59:16 -0400 Subject: [torqueusers] resources_used. mem problems In-Reply-To: <1350569341.15740.762.camel@browncoat.jics.utk.edu> References: <1350569341.15740.762.camel@browncoat.jics.utk.edu> Message-ID: Hi Brock and Troy, First, I thank both of you for your emails. One our old cluster OpenMPI wasn't coupled with Torque TM API where as it is on our new cluster. Just like Brock suggested I gripped for TM and found the difference. Like Troy suggested, OpenMPI is accounting accurately on our new cluster. But for the reasons I mentioned before we don't want this behavior. So, we moved the 4 files from lib/openmpi/*tm* and now it is reporting the memory that's being used only on the rank=0. Even though it is simple to change the memory statement from "#PBS -l mem=46GB" to "#PBS -l mem=(number of nodes requested*46)GB", we would like to keep it the way we do it on our old clusters. But I see that it is quite useful to have the real used memory for the entire job (sum of it on all the nodes) rather than just on the node with rank=0. Main thing is we don't have to pass the host file. May be there are other benefits we'll get with it being coupled with TM API. But what I'm not sure is, whether Moab/Torque kill the job if it tries to use memory more than it is allocated on one of the nodes (not with rank=0). I know that it does if the job tries to use memory more than what it is allocated on the node with rank=0. Thanks, Sreedhar. On Oct 18, 2012, at 10:09 AM, Troy Baer wrote: > On Thu, 2012-10-18 at 09:58 -0400, Sreedhar Manchu wrote: >> Has anyone seen this behavior on your clusters? Given that it is >> working fine with MVAPICH2 I'm thinking it has to do with OpenMPI >> 1.4.5 (as it works fine with 1.4.3). We are testing 1.4.3 on our new >> clusters and plan to test 1.4.5 on our old clusters. But I thought >> it'd be useful to know whether anyone has any thoughts on it. Please >> let me know. > > It sounds to me that OpenMPI is doing the right thing here, in that it's > launching processes through the TORQUE TM API so that its resource usage > is being accounting accurately. OTOH, I'm guessing that your MVAPICH2 > install is using either rsh or ssh to start remote processes, which does > *NOT* handle resource usage accounting (or signal delivery) correctly. > > I would recommend getting your MVAPICH2 install to use the TM API to > launch processes, either using the mpiexec.hydra script that likely > comes with MVAPICH2 or using OSC mpiexec [1]. > > [1] https://www.osc.edu/~djohnson/mpiexec/index.php > > --Troy > -- > Troy Baer, Senior HPC System Administrator > National Institute for Computational Sciences, University of Tennessee > http://www.nics.tennessee.edu/ > Phone: 865-241-4233 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Fri Oct 19 16:34:22 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 19 Oct 2012 15:34:22 -0700 Subject: [torqueusers] Torque 4.1.2 does not accept hostname with '-' In-Reply-To: References: Message-ID: <20121019223420.GF8827@lbl.gov> On Thursday, 18 October 2012, at 09:03:03 (+0800), Clotho Tsang wrote: > The following problem is found at Torque 4.1.2, but not 4.1.0. > > At RHEL6, if the headnode hostname consists of char "-", > jobs will keep running but not stop, checkjob shows message > "cannot start job - RM failure, rc: 15033, msg: 'End of File' " > > The problem is not found if the hostname has no "-". We are seeing the same issue at our site. (Our master node's name ends in "-00") We have a ticket open with Adaptive for this, but so far it's proved very elusive. Looking at the code, the only place that really sticks out to me where '-' is handled specially (at least in terms of hostnames) has to do with NUMA. NUMA nodes appear to be named using a hyphen followed by one or more digits. I noticed that your hostname also had a hyphen followed by a digit. Have you by any chance tried a hostname with hyphens but no numbers in it? Have you had any luck tracking down the issue in the code? I've been looking at it, but I don't see anything jumping out at me. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From nt_mahmood at yahoo.com Sat Oct 20 06:56:43 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 20 Oct 2012 05:56:43 -0700 (PDT) Subject: [torqueusers] low network utilization In-Reply-To: References: <1350502323.41278.YahooMailNeo@web111703.mail.gq1.yahoo.com> Message-ID: <1350737803.1822.YahooMailNeo@web111703.mail.gq1.yahoo.com> >This sounds like you are generating more IOPS than your storage system >can deliver, probably because you are doing many small random >requests. The cluster is diskless so all IO operations are done on the server. I run "iostat 1" on the server before running the application on the compute node. As you can see, the average user cpu usage is 0%, then it goes to 23% and then goes to 0% which means I terminate the application on the node. Thing is, the read/write operations per second is almost zero during the application run. So I wonder why cpu user on server is 20%. Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 4.00???????? 0.00??????? 68.00????????? 0???????? 68 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.50??? 0.00??? 0.19??? 0.00??? 0.00?? 99.31 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.81??? 0.00??? 2.89??? 0.00??? 0.00?? 75.30 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 20.93??? 0.00??? 4.32??? 0.00??? 0.00?? 74.75 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.97??? 0.00??? 3.20??? 0.00??? 0.00?? 74.83 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.82??? 0.00??? 3.39??? 0.00??? 0.00?? 74.80 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.49??? 0.00??? 2.82??? 0.00??? 0.00?? 74.69 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.89??? 0.00??? 3.26??? 0.25??? 0.00?? 74.59 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 4.00???????? 0.00??????? 88.00????????? 0???????? 88 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.29??? 0.00??? 4.01??? 0.00??? 0.00?? 74.70 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.96??? 0.00??? 3.20??? 0.00??? 0.00?? 74.84 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.07??? 0.00??? 3.13??? 0.00??? 0.00?? 74.80 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.35??? 0.00??? 2.82??? 0.00??? 0.00?? 74.83 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.15??? 0.00??? 3.01??? 0.00??? 0.00?? 74.84 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.88??? 0.00??? 3.39??? 0.00??? 0.00?? 74.73 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.97??? 0.00??? 3.14??? 0.00??? 0.00?? 74.89 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.87??? 0.00??? 3.38??? 0.00??? 0.00?? 74.75 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.13??? 0.00??? 3.07??? 0.00??? 0.00?? 74.80 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.63??? 0.00??? 0.69??? 0.00??? 0.00?? 98.68 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 0.00??? 0.00??? 0.00?? 99.94 ? Regards, Mahmood ________________________________ From: Jonathan Barber To: Mahmood Naderan ; Torque Users Mailing List Sent: Thursday, October 18, 2012 11:09 AM Subject: Re: [torqueusers] low network utilization On 17 October 2012 20:32, Mahmood Naderan wrote: > Dear all, > I have noticed that when I submit a job on a working node, the network speed > is about 20Mb. That is quite slow because the switch speed is 1000Mb.? That > causes the processes to be in "D" state and the cpu usages are much below > 100%. This sounds like you are generating more IOPS than your storage system can deliver, probably because you are doing many small random requests. You should first check that the server NIC and the switch port are both running at 1GbE (using "ethtool" on the host and connecting to the switch and verifying the port status). On the NFS server (assuming linux) check the block device that supports the NFS exported file system with "iostat -kx 1". If you have ~100% in the "%util" column then you are limited by the storage system. You can monitor the host network throughput with "iftop" (assuming linux). You can get a crude idea of your baseline NFS performance by using dd with large (larger than the largest amount of memory available to the server and client) files and reading / writing them from the client. For better measurements, I suggest fio: http://freecode.com/projects/fio although it is a lot more complicated to interpret the results. Cheers > I thought there is a problem with NFS however the stats shows about 1.3k > requests per second which is not really high. > Maybe Torque transfers data (from worker to server which has disks) quickly. > > How can I investigate more? > > Regards, > Mahmood > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Jonathan Barber -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121020/95578993/attachment-0001.html From nt_mahmood at yahoo.com Sat Oct 20 07:23:07 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 20 Oct 2012 06:23:07 -0700 (PDT) Subject: [torqueusers] low network utilization In-Reply-To: <1350737803.1822.YahooMailNeo@web111703.mail.gq1.yahoo.com> References: <1350502323.41278.YahooMailNeo@web111703.mail.gq1.yahoo.com> <1350737803.1822.YahooMailNeo@web111703.mail.gq1.yahoo.com> Message-ID: <1350739387.21582.YahooMailNeo@web111701.mail.gq1.yahoo.com> Really sorry for the inconvenience... I did a mistake in my previous reply. So the iostat output was incorrect. Please ignore that. I ran the simulation again. The true configuration is: 1- The application is run on the compute node 2- I ran "iostat 1" on the server. While it is printing every second, I run the application on the compute node and terminate it. 3- I run "top" on the compute node. The iostat output looks like: Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 7.00???????? 0.00?????? 140.00????????? 0??????? 140 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.00??? 0.00??? 1.13??? 0.00??? 0.00?? 98.87 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda????????????? 29.00???????? 0.00?????? 128.00????????? 0??????? 128 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 0.25??? 0.00??? 0.00?? 99.69 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 0.63??? 0.00??? 0.00?? 99.31 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.00??? 0.00??? 1.00??? 0.00??? 0.00?? 99.00 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.00??? 0.00??? 1.50??? 0.00??? 0.00?? 98.50 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 0.75??? 0.38??? 0.00?? 98.81 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 9.00???????? 0.00?????? 164.00????????? 0??????? 164 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 1.32??? 0.00??? 0.00?? 98.62 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 1.07??? 0.00??? 0.00?? 98.87 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.00??? 0.00??? 0.63??? 0.00??? 0.00?? 99.37 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 2.51??? 0.00??? 0.00?? 97.42 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.00??? 0.00??? 2.14??? 0.13??? 0.00?? 97.74 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 8.00???????? 0.00??????? 76.00????????? 0???????? 76 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 0.19??? 0.00??? 0.00?? 99.75 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 Also the top output during the execution of application on the compute node: ? PID USER????? PR? NI? VIRT? RES? SHR S %CPU %MEM??? TIME+? COMMAND 10570 mahm ??? 20?? 0? 296m? 92m? 14m D?? 42? 0.1?? 0:06.13 atIco 10568 mahm ??? 20?? 0? 298m? 94m? 14m R?? 32? 0.1?? 0:04.02 atIco 10567 mahm ??? 20?? 0? 296m? 92m? 14m D?? 23? 0.1?? 0:04.41 atIco 10569 mahm ??? 20?? 0? 298m? 93m? 14m D?? 21? 0.1?? 0:03.63 atIco Any feedback is appreciated. Regards, Mahmood ________________________________ From: Mahmood Naderan To: Jonathan Barber Cc: torque cluster Sent: Saturday, October 20, 2012 2:56 PM Subject: Re: [torqueusers] low network utilization >This sounds like you are generating more IOPS than your storage system >can deliver, probably because you are doing many small random >requests. The cluster is diskless so all IO operations are done on the server. I run "iostat 1" on the server before running the application on the compute node. As you can see, the average user cpu usage is 0%, then it goes to 23% and then goes to 0% which means I terminate the application on the node. Thing is, the read/write operations per second is almost zero during the application run. So I wonder why cpu user on server is 20%. Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 4.00???????? 0.00??????? 68.00????????? 0???????? 68 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.50??? 0.00??? 0.19??? 0.00??? 0.00?? 99.31 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.81??? 0.00??? 2.89??? 0.00??? 0.00?? 75.30 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 20.93??? 0.00??? 4.32??? 0.00??? 0.00?? 74.75 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.97??? 0.00??? 3.20??? 0.00??? 0.00?? 74.83 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.82??? 0.00??? 3.39??? 0.00??? 0.00?? 74.80 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.49??? 0.00??? 2.82??? 0.00??? 0.00?? 74.69 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.89??? 0.00??? 3.26??? 0.25??? 0.00?? 74.59 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 4.00???????? 0.00??????? 88.00????????? 0???????? 88 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.29??? 0.00??? 4.01??? 0.00??? 0.00?? 74.70 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.96??? 0.00??? 3.20??? 0.00??? 0.00?? 74.84 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.07??? 0.00??? 3.13??? 0.00??? 0.00?? 74.80 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.35??? 0.00??? 2.82??? 0.00??? 0.00?? 74.83 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.15??? 0.00??? 3.01??? 0.00??? 0.00?? 74.84 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.88??? 0.00??? 3.39??? 0.00??? 0.00?? 74.73 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.97??? 0.00??? 3.14??? 0.00??? 0.00?? 74.89 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 21.87??? 0.00??? 3.38??? 0.00??? 0.00?? 74.75 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ????????? 22.13??? 0.00??? 3.07??? 0.00??? 0.00?? 74.80 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.63??? 0.00??? 0.69??? 0.00??? 0.00?? 98.68 Device:??????????? tps??? kB_read/s??? kB_wrtn/s??? kB_read??? kB_wrtn sda?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdb?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdc?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sdd?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 sde?????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 dm-0????????????? 0.00???????? 0.00???????? 0.00????????? 0????????? 0 avg-cpu:? %user?? %nice %system %iowait? %steal?? %idle ?????????? 0.06??? 0.00??? 0.00??? 0.00??? 0.00?? 99.94 ? Regards, Mahmood ________________________________ From: Jonathan Barber To: Mahmood Naderan ; Torque Users Mailing List Sent: Thursday, October 18, 2012 11:09 AM Subject: Re: [torqueusers] low network utilization On 17 October 2012 20:32, Mahmood Naderan wrote: > Dear all, > I have noticed that when I submit a job on a working node, the network speed > is about 20Mb. That is quite slow because the switch speed is 1000Mb.? That > causes the processes to be in "D" state and the cpu usages are much below > 100%. This sounds like you are generating more IOPS than your storage system can deliver, probably because you are doing many small random requests. You should first check that the server NIC and the switch port are both running at 1GbE (using "ethtool" on the host and connecting to the switch and verifying the port status). On the NFS server (assuming linux) check the block device that supports the NFS exported file system with "iostat -kx 1". If you have ~100% in the "%util" column then you are limited by the storage system. You can monitor the host network throughput with "iftop" (assuming linux). You can get a crude idea of your baseline NFS performance by using dd with large (larger than the largest amount of memory available to the server and client) files and reading / writing them from the client. For better measurements, I suggest fio: http://freecode.com/projects/fio although it is a lot more complicated to interpret the results. Cheers > I thought there is a problem with NFS however the stats shows about 1.3k > requests per second which is not really high. > Maybe Torque transfers data (from worker to server which has disks) quickly. > > How can I investigate more? > > Regards, > Mahmood > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Jonathan Barber _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121020/c0964d2c/attachment-0001.html From ezellma at ornl.gov Sat Oct 20 22:41:14 2012 From: ezellma at ornl.gov (Ezell, Matthew A.) Date: Sun, 21 Oct 2012 00:41:14 -0400 Subject: [torqueusers] Torque 4.1.2 does not accept hostname with '-' Message-ID: <2594EA566E9F0B4B9DE1C98338260302C64487CE03@EXCHMBA.ornl.gov> >> At RHEL6, if the headnode hostname consists of char "-", >> jobs will keep running but not stop, checkjob shows message >> "cannot start job - RM failure, rc: 15033, msg: 'End of File' " >> >> The problem is not found if the hostname has no "-". > Have you had any luck tracking down the issue in the code? I've been > looking at it, but I don't see anything jumping out at me. We found this on our test system. The problem was in the 4.1.2 "subjob" feature. We developed patches and sent them to Adaptive. You can either pull r6794 and r6799 from the subversion branch '4.1-fixes' or just wait until 4.1.3 is released. Good luck, ~Matt --- Matt Ezell HPC Systems Administrator Oak Ridge National Laboratory From mej at lbl.gov Mon Oct 22 10:52:42 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 22 Oct 2012 09:52:42 -0700 Subject: [torqueusers] Torque 4.1.2 does not accept hostname with '-' In-Reply-To: <2594EA566E9F0B4B9DE1C98338260302C64487CE03@EXCHMBA.ornl.gov> References: <2594EA566E9F0B4B9DE1C98338260302C64487CE03@EXCHMBA.ornl.gov> Message-ID: <20121022165241.GN8827@lbl.gov> On Sunday, 21 October 2012, at 00:41:14 (-0400), Ezell, Matthew A. wrote: > We found this on our test system. The problem was in the 4.1.2 > "subjob" feature. We developed patches and sent them to Adaptive. > You can either pull r6794 and r6799 from the subversion branch > '4.1-fixes' or just wait until 4.1.3 is released. Thanks for the pointers. Unfortunately, we've been running with those changes in place for quite some time now, and it doesn't seem to have fixed the problem. So I guess we'll keep looking. For what it's worth, I found this error some time ago (which, based on the revision numbers you gave me, came from your patch). It doesn't seem to fix the issue either, but it's still likely needed (because dash will always be exactly equal to dot as a result): Index: src/server/job_func.c =================================================================== --- src/server/job_func.c (revision 6967) +++ src/server/job_func.c (working copy) @@ -2197,7 +2197,7 @@ * the get the external sub-job */ if (get_subjob == TRUE) { - dot = strchr(jobid, '-'); + dot = strchr(jobid, '.'); if (((dash = strchr(jobid, '-')) != NULL) && (dot != NULL) && HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From ezellma at ornl.gov Mon Oct 22 11:21:02 2012 From: ezellma at ornl.gov (Ezell, Matthew A.) Date: Mon, 22 Oct 2012 13:21:02 -0400 Subject: [torqueusers] Torque 4.1.2 does not accept hostname with '-' In-Reply-To: <20121022165241.GN8827@lbl.gov> Message-ID: On 10/22/12 12:52 PM, "Michael Jennings" wrote: >Thanks for the pointers. Unfortunately, we've been running with those >changes in place for quite some time now, and it doesn't seem to have >fixed the problem. So I guess we'll keep looking. Interesting. Without the first patch, I couldn't start jobs. Without the second, jobs never showed as completed or disappeared from the server. Do your jobs fail to start, or fail to exit? Is there any difference if you do a single-node job versus a multi-node job? If you turn logging up to 7 on the pbs_server and pbs_moms, is there anything interesting written to the logs? >For what it's worth, I found this error some time ago (which, based on >the revision numbers you gave me, came from your patch). It doesn't >seem to fix the issue either, but it's still likely needed (because >dash will always be exactly equal to dot as a result): > >Index: src/server/job_func.c >=================================================================== >--- src/server/job_func.c (revision 6967) >+++ src/server/job_func.c (working copy) >@@ -2197,7 +2197,7 @@ > * the get the external sub-job */ > if (get_subjob == TRUE) > { >- dot = strchr(jobid, '-'); >+ dot = strchr(jobid, '.'); > > if (((dash = strchr(jobid, '-')) != NULL) && > (dot != NULL) && The patch I sent in had a '.' in there, but apparently that isn't what got committed. As written, that's going to break the heterogenous subjob feature. But if you aren't using that feature, this should still protect you from the original defect. ~Matt --- Matt Ezell HPC Systems Administrator Oak Ridge National Laboratory From jwilkinson at stoneeagle.com Mon Oct 22 15:48:11 2012 From: jwilkinson at stoneeagle.com (Jack Wilkinson) Date: Mon, 22 Oct 2012 21:48:11 +0000 Subject: [torqueusers] files being written to wrong batch node. Message-ID: <00051F5C670B8444B35CB2B31B9B1D090C53702D@se-ex2.stoneeagle.com> We've just configured a development batch farm so that our development folk don't trash the production environment. It's one head box and two batch boxes. All running Centos 6.3. Configured with torque 2.57 and maui 3.3-4. Everything is running as expected except for the one following issue. In listings four and five, notice, that the "11111111" and "22222222" are correctly attached to the appropriate submit file, however, the results in the output files show to have been run on the "opposite" node than was requested. Then looking at listings six and seven, from the batch boxes, the file names that were written to those boxes are the reverse from the requested host, EXCEPT that the content of the host file shows that it was run on the host that it is being listed on. This is utterly screwy!! Anyone have any idea? Kind regards, jack ________________________________ one $ cat go.sh qsub one-1.sbm qsub one-2.sbm ________________________________ two $ cat one-1.sbm #!/bin/bash #PBS -N testone-1.1234 #PBS -l nodes=1:ppn=1 #PBS -l nodes=srvdevbatch01 ###PBS -m e ###PBS -M jwilkinson #PBS -o /home/jwilkinson/onetest/one-1.out #PBS -e /home/jwilkinson/onetest/one-1.err #PBS -l nice=19 #PBS -l walltime=00:01:00 hostname hostname > one-1.host echo "11111111111111111111111111111111" date ls -lRa /SRVFS/dev-bogner | wc ls -lR /SRVFS/dev-bogner/PRINT > one-1.ls sleep 15 date exit 0 ________________________________ Three $ cat one-2.sbm #!/bin/bash #PBS -N testone-2.1234 #PBS -l nodes=1:ppn=1 #PBS -l nodes=srvdevbatch02 ###PBS -m e ###PBS -M jwilkinson #PBS -o /home/jwilkinson/onetest/one-2.out #PBS -e /home/jwilkinson/onetest/one-2.err #PBS -l nice=19 #PBS -l walltime=00:01:00 hostname hostname > one-2.host echo "22222222222222222222222222222222" date ls -lRa /SRVFS/dev-bogner | wc ls -lR /SRVFS/dev-bogner/PRINT > one-2.ls sleep 15 date exit 0 ________________________________ four $ cat one-1.out srvDevBatch02 11111111111111111111111111111111 Mon Oct 22 14:43:02 CDT 2012 3280 20241 174637 Mon Oct 22 14:43:18 CDT 2012 ________________________________ five $ cat one-2.out srvDevBatch01 22222222222222222222222222222222 Mon Oct 22 14:43:02 CDT 2012 3280 20241 174637 Mon Oct 22 14:43:18 CDT 2012 ________________________________ six On srvbatch01: $ ls -l -rw-rw-r--. 1 jwilkinson jwilkinson 14 Oct 22 14:43 one-2.host -rw-rw-r--. 1 jwilkinson jwilkinson 39072 Oct 22 14:43 one-2.ls $ cat one-2.host srvDevBatch01 ________________________________ seven On srvbatch02: $ ls -l -rw-rw-r--. 1 jwilkinson jwilkinson 14 Oct 22 14:43 one-1.host -rw-rw-r--. 1 jwilkinson jwilkinson 39072 Oct 22 14:43 one-1.ls $ cat one-1.host srvDevBatch02 Jack Wilkinson, Programmer Services | VPay(r) P: 972.367-6622 jwilkinson at stoneeagle.com www.stoneeagle.com www.vpayusa.com 111 W. Spring Valley Rd., #100 Richardson, TX 75081 CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you received this email and are not the intended recipient, please inform the sender by email reply and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121022/626e602a/attachment-0001.html From mej at lbl.gov Mon Oct 22 18:33:00 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 22 Oct 2012 17:33:00 -0700 Subject: [torqueusers] Torque 4.1.2 does not accept hostname with '-' In-Reply-To: References: <20121022165241.GN8827@lbl.gov> Message-ID: <20121023003259.GV8827@lbl.gov> On Monday, 22 October 2012, at 13:21:02 (-0400), Ezell, Matthew A. wrote: > Interesting. Without the first patch, I couldn't start jobs. Without the > second, jobs never showed as completed or disappeared from the server. Do > your jobs fail to start, or fail to exit? Is there any difference if you > do a single-node job versus a multi-node job? > > If you turn logging up to 7 on the pbs_server and pbs_moms, is there > anything interesting written to the logs? Nothing that has stood out. We've got a ticket open with Adaptive that they're working on. The failures are intermittent, and we see failures in both starting jobs and exiting jobs. Based on what you describe, it's likely a different problem altogether. Darn; I was hoping the mystery was solved! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From shantanugadgil at yahoo.com Thu Oct 25 07:06:47 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Thu, 25 Oct 2012 06:06:47 -0700 (PDT) Subject: [torqueusers] Submit jobs as root using TORQUE 4.1.x Message-ID: <1351170407.98299.YahooMailClassic@web141003.mail.bf1.yahoo.com> Hi, I know have asked this before and was wondering whether this is possible or actually removed from TORQUE 4.x ? Using CentOS 6.x as my server node, I am not able to submit jobs as root. The steps available in the docs works with TORQUE v3. Has the 'submit job as root' functionality been actually removed from the 4.x series completely? The last I knew it was some issue with "the way localhost is resolved by the call getaddrinfo" Is there any workaround which I can do on my CentOS 6 server to get this working? Regards, Shantanu From delphine.ramalingom at univ-reunion.fr Thu Oct 25 22:16:10 2012 From: delphine.ramalingom at univ-reunion.fr (Delphine Ramalingom) Date: Fri, 26 Oct 2012 08:16:10 +0400 Subject: [torqueusers] epilogue Message-ID: <508A0E8A.6040400@univ-reunion.fr> Hello, I put a simply epilogue that runs after each job executes, but, the problem is that it runs twice and I don't know why. There is only one script : -rwxr-xr-x 1 root root 314 17 oct. 15:06 epilogue and it just list args of the jobs : #!/bin/sh echo "--------------------------" echo "Epilogue Args:" echo "Job ID: $1" echo "User ID: $2" echo "Group ID: $3" echo "Job Name: $4" echo "Session ID: $5" echo "Resource List: $6" echo "Resources Used: $7" echo "Queue Name: $8" echo "Account String: $9" echo "" echo "--------------------------" exit 0 Can you help me please ? My torque version is 4.0.2. and The scheduler is maui. Thanks o lot, Delphine From craig.tierney at noaa.gov Fri Oct 26 08:42:57 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Fri, 26 Oct 2012 08:42:57 -0600 Subject: [torqueusers] epilogue In-Reply-To: <508A0E8A.6040400@univ-reunion.fr> References: <508A0E8A.6040400@univ-reunion.fr> Message-ID: This is a known bug as of at least 4.1.1. It is supposedly being worked on. I am hoping it is fixed in 4.1.3. I know of another site (and which I plan to do as well) is to put a work around in. In pseudo-code: epifn=/tmp/epilog.ran.$PBS_JOBID if [ -f $epifn ]; then echo "Epilogue already ran exiting" exit fi touch $epifn # rest of your epilog I say it is pseudo code because I haven't done this myself yet and I don't know if the exit will mess up the return code of the job or not, and if $PBS_JOBID is actually defined or you need to use one of the other arguments to the script to get that value. Craig On Thu, Oct 25, 2012 at 10:16 PM, Delphine Ramalingom wrote: > Hello, > > I put a simply epilogue that runs after each job executes, but, the > problem is that it runs twice and I don't know why. > > There is only one script : > -rwxr-xr-x 1 root root 314 17 oct. 15:06 epilogue > > and it just list args of the jobs : > > #!/bin/sh > echo "--------------------------" > echo "Epilogue Args:" > echo "Job ID: $1" > echo "User ID: $2" > echo "Group ID: $3" > echo "Job Name: $4" > echo "Session ID: $5" > echo "Resource List: $6" > echo "Resources Used: $7" > echo "Queue Name: $8" > echo "Account String: $9" > echo "" > echo "--------------------------" > exit 0 > > Can you help me please ? My torque version is 4.0.2. and The scheduler > is maui. > > Thanks o lot, > Delphine > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From tegner at renget.se Sat Oct 27 22:09:44 2012 From: tegner at renget.se (Jon Tegner) Date: Sun, 28 Oct 2012 05:09:44 +0100 Subject: [torqueusers] Submitting very short jobs Message-ID: <508CB008.6070604@renget.se> Hi, we have for a long time used maui/torque, and it has worked really well for out applications (mpi-jobs with many cores and execution times of days/weeks). But, we have other applications which instead consists of hundreds/thousands of single core very short applications (a few seconds). Haven't tried yet, but I imagine that due overhead it would be inefficient to use maui/torque in the same way as we have done previously. I'm sure I'm not the first one with this kind of problem, and am just wondering if there are any best practices regarding this kind of problem. Thanks! /jon From yotama9 at gmail.com Mon Oct 29 07:54:08 2012 From: yotama9 at gmail.com (Yotam Avital) Date: Mon, 29 Oct 2012 15:54:08 +0200 Subject: [torqueusers] no output on pbs jobs Message-ID: Hi All. I'm trying to run my simulations on a pbs cluster. I can submit the job but there is no output to the hard drive. My jobs are long (several weeks) and I need to know if the progression is good or there might be a bug somewhere. Also, the code should produce some files that are used for me to relaunch the run in an advanced state. Searching the user manual didn't produce any valuable information apart from that it's how pbs is configured. How can I get pbs to print output? Thanks. -- My other email account has a "professional" signature. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121029/5c1f6bb7/attachment.html From jpeltier at sfu.ca Mon Oct 29 09:08:19 2012 From: jpeltier at sfu.ca (James A. Peltier) Date: Mon, 29 Oct 2012 08:08:19 -0700 (PDT) Subject: [torqueusers] Submitting very short jobs In-Reply-To: <508CB008.6070604@renget.se> Message-ID: <1586167756.83861380.1351523299311.JavaMail.root@jaguar10.sfu.ca> We use Torque routing and Maui priorities to set a much higher priority to jobs that run for short periods. This allows for those short running jobs to get on more frequently. Add to that fairshare and we're able to roughly balance out the jobs that are running. ----- Original Message ----- | Hi, | | we have for a long time used maui/torque, and it has worked really | well | for out applications (mpi-jobs with many cores and execution times of | days/weeks). | | But, we have other applications which instead consists of | hundreds/thousands of single core very short applications (a few | seconds). | | Haven't tried yet, but I imagine that due overhead it would be | inefficient to use maui/torque in the same way as we have done | previously. | | I'm sure I'm not the first one with this kind of problem, and am just | wondering if there are any best practices regarding this kind of | problem. | | Thanks! | | /jon | _______________________________________________ | torqueusers mailing list | torqueusers at supercluster.org | http://www.supercluster.org/mailman/listinfo/torqueusers | -- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier at sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier "The smartest people are constantly revising their understanding, reconsidering a problem they thought they?d already solved. They?re open to new points of view, new information, new ideas, contradictions, and challenges to their own way of thinking." - Jeff Bezos From akohlmey at cmm.chem.upenn.edu Mon Oct 29 09:36:41 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 29 Oct 2012 16:36:41 +0100 Subject: [torqueusers] Submitting very short jobs In-Reply-To: <508CB008.6070604@renget.se> References: <508CB008.6070604@renget.se> Message-ID: On Sun, Oct 28, 2012 at 5:09 AM, Jon Tegner wrote: > Hi, > > we have for a long time used maui/torque, and it has worked really well > for out applications (mpi-jobs with many cores and execution times of > days/weeks). > > But, we have other applications which instead consists of > hundreds/thousands of single core very short applications (a few seconds). > Haven't tried yet, but I imagine that due overhead it would be > inefficient to use maui/torque in the same way as we have done previously. yes, it is *extremely* inefficient to submit and run jobs in torque/maui that don't run for at least half an hour or so. however, that is not a problem. since you are submitting a shell script, there is nothing keeping you from submitting a shell script that combines hundreds of these short jobs into a single submission. > I'm sure I'm not the first one with this kind of problem, and am just > wondering if there are any best practices regarding this kind of problem. in many cases, it may be advantageous to write scripts that automate the process or creating and submitting the job bundle script. if needed, this can be even expanded into having jobs processed by a single parallel job using a manager/worker setup and then reading a file with the individual commands. we use such a (simple, self-written) tool for cases where the individual calculations can take a varying amount of time and thus we also have load balancing across the nodes (and to run efficiently on machines where no single processor jobs are allowed). axel. > Thanks! > > /jon > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From dbeer at adaptivecomputing.com Mon Oct 29 09:59:25 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 29 Oct 2012 09:59:25 -0600 Subject: [torqueusers] no output on pbs jobs In-Reply-To: References: Message-ID: Yotam, If you have some kind of shared filesystem you may wish to look into the mom config parameter $spool_as_final_name. This has the mom write its output file directly to the location of the output file instead of spooling. This is documented in Appendix C. As far as finding the output for jobs that have already run - you may want to check the spool directories on the mother superior for the job (first node in the exec host list) and on the node that runs pbs_server. Cheers, David On Mon, Oct 29, 2012 at 7:54 AM, Yotam Avital wrote: > Hi All. > > I'm trying to run my simulations on a pbs cluster. I can submit the job > but there is no output to the hard drive. My jobs are long (several weeks) > and I need to know if the progression is good or there might be a bug > somewhere. Also, the code should produce some files that are used for me to > relaunch the run in an advanced state. Searching the user manual didn't > produce any valuable information apart from that it's how pbs is > configured. > > How can I get pbs to print output? > > Thanks. > > -- > My other email account has a "professional" signature. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121029/aaebff58/attachment-0001.html From dbeer at adaptivecomputing.com Mon Oct 29 17:44:25 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 29 Oct 2012 17:44:25 -0600 Subject: [torqueusers] 4.1.3 Is Now Released Message-ID: All, TORQUE 4.1.3 is now released. This has more than 50 fixes against 4.1.2, and is doing much better stability-wise in production environments (we have several users that were running in production from 4.1-fixes until 4.1.3 came out). Enjoy, -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121029/2340afe0/attachment.html From yotama9 at gmail.com Tue Oct 30 01:59:31 2012 From: yotama9 at gmail.com (Yotam Avital) Date: Tue, 30 Oct 2012 09:59:31 +0200 Subject: [torqueusers] torqueusers Digest, Vol 99, Issue 18 In-Reply-To: References: Message-ID: Hi David, I have found that the temp files (.o and .e) are sent to /var/spool/PBS/spool but the output has only the first few lines my code print. I think this may have something to do with flushing the output. Could I be right. Also, I didn't understand what can I do about the file that should be generated by my code Thank you very much for your time. > Yotam, > > If you have some kind of shared filesystem you may wish to look into the > mom config parameter $spool_as_final_name. This has the mom write its > output file directly to the location of the output file instead of > spooling. This is documented in Appendix C. > > As far as finding the output for jobs that have already run - you may want > to check the spool directories on the mother superior for the job (first > node in the exec host list) and on the node that runs pbs_server. > > Cheers, > > David > > On Mon, Oct 29, 2012 at 7:54 AM, Yotam Avital wrote: > > > Hi All. > > > > I'm trying to run my simulations on a pbs cluster. I can submit the job > > but there is no output to the hard drive. My jobs are long (several > weeks) > > and I need to know if the progression is good or there might be a bug > > somewhere. Also, the code should produce some files that are used for me > to > > relaunch the run in an advanced state. Searching the user manual didn't > > produce any valuable information apart from that it's how pbs is > > configured. > > > > How can I get pbs to print output? > > > > Thanks. > > > > -- > > My other email account has a "professional" signature. > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121030/cfbd4e1e/attachment.html From lukas.grossar at uni-graz.at Sun Oct 28 00:43:15 2012 From: lukas.grossar at uni-graz.at (Lukas Grossar) Date: Sun, 28 Oct 2012 07:43:15 +0100 Subject: [torqueusers] Submitting very short jobs In-Reply-To: <508CB008.6070604@renget.se> References: <508CB008.6070604@renget.se> Message-ID: <508CD403.7090403@uni-graz.at> On 10/28/2012 05:09 AM, Jon Tegner wrote: > Hi, > > we have for a long time used maui/torque, and it has worked really well > for out applications (mpi-jobs with many cores and execution times of > days/weeks). > > But, we have other applications which instead consists of > hundreds/thousands of single core very short applications (a few seconds). > > Haven't tried yet, but I imagine that due overhead it would be > inefficient to use maui/torque in the same way as we have done previously. I have no idea about the overhead problem, but running short single-core jobs in an environment of long multi-core jobs sound like a good application for the backfill algorithm in Maui. http://www.adaptivecomputing.com/resources/docs/maui/8.2backfill.php > I'm sure I'm not the first one with this kind of problem, and am just > wondering if there are any best practices regarding this kind of problem. > > Thanks! > > /jon Hope this helps! Lukas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121028/3637a0bf/attachment.html From potapovweb at yahoo.com Mon Oct 29 16:08:13 2012 From: potapovweb at yahoo.com (Vladimir Potapov) Date: Mon, 29 Oct 2012 18:08:13 -0400 Subject: [torqueusers] limit cpu cores available to a user Message-ID: <508EFE4D.2030609@yahoo.com> Hi all, I am afraid this is a F.A.Q. question unfortunately I could not find an answer to it. I use torque and maui. My cluster has 10 nodes, each node has 8 cores, 80 cores total. There is a single queue. I want to configure the queue such that a single user can never use more than let's say 50 cores simultaneously -- irrespective of how a user requests it (multiple jobs, array jobs, or just a single job). It is possible? Thanks, Vladimir Potapov From andre.gemuend at scai.fraunhofer.de Tue Oct 30 10:00:42 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Tue, 30 Oct 2012 17:00:42 +0100 (CET) Subject: [torqueusers] limit cpu cores available to a user In-Reply-To: <508EFE4D.2030609@yahoo.com> Message-ID: <56839188.15627430.1351612842767.JavaMail.root@scai.fraunhofer.de> Hi Vladimir, this is a job for maui. You can specify it with MAXJOB, MAXPROC, MAXPE, etc., e.g. USERCFG[default] MAXJOB=10 MAXPE=50 See here for more info: http://www.adaptivecomputing.com/resources/docs/maui/6.2throttlingpolicies.php hth Andre ----- Urspr?ngliche Mail ----- > Hi all, > > I am afraid this is a F.A.Q. question unfortunately I could not find > an > answer to it. > > I use torque and maui. My cluster has 10 nodes, each node has 8 > cores, > 80 cores total. There is a single queue. I want to configure the > queue > such that a single user can never use more than let's say 50 cores > simultaneously -- irrespective of how a user requests it (multiple > jobs, > array jobs, or just a single job). It is possible? > > Thanks, > Vladimir Potapov > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From john.hanks at usu.edu Wed Oct 31 08:10:19 2012 From: john.hanks at usu.edu (John Hanks) Date: Wed, 31 Oct 2012 08:10:19 -0600 Subject: [torqueusers] 4.1.3 Is Now Released In-Reply-To: References: Message-ID: I'm building this on Centos 6.3 and ran into a problem with --enable-drmaa. When DRMAA is enabled, make install attempts to install the drmaa docs in /doc, ignoring the --prefix settings. I tracked this down to datarootdir missing from the Makefile.in files and got the install to work by adding it to Makefile.in with find . -name Makefile.in -exec sed -i 's/datadir = @datadir@/datadir = @datadir@\ndatarootdir = @datarootdir@/g' {} \; then running ./autogen.sh before the configure step. Thanks, jbh On Mon, Oct 29, 2012 at 5:44 PM, David Beer wrote: > All, > > TORQUE 4.1.3 is now released. This has more than 50 fixes against 4.1.2, and > is doing much better stability-wise in production environments (we have > several users that were running in production from 4.1-fixes until 4.1.3 > came out). > > Enjoy, > > -- > David Beer | Senior Software Engineer > Adaptive Computing > From dbeer at adaptivecomputing.com Wed Oct 31 11:05:50 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 31 Oct 2012 11:05:50 -0600 Subject: [torqueusers] 4.1.3 Is Now Released In-Reply-To: References: Message-ID: John, Can you log this as a bug in our bugzilla so we can track it? (sometimes the email reports don't get followed up with) Our bugzilla is: http://www.clusterresources.com/bugzilla/ David On Wed, Oct 31, 2012 at 8:10 AM, John Hanks wrote: > I'm building this on Centos 6.3 and ran into a problem with > --enable-drmaa. When DRMAA is enabled, make install attempts to > install the drmaa docs in /doc, ignoring the --prefix settings. I > tracked this down to datarootdir missing from the Makefile.in files > and got the install to work by adding it to Makefile.in with > > find . -name Makefile.in -exec sed -i 's/datadir = @datadir@/datadir = > @datadir@\ndatarootdir = @datarootdir@/g' {} \; > > then running ./autogen.sh before the configure step. > > Thanks, > > jbh > > On Mon, Oct 29, 2012 at 5:44 PM, David Beer > wrote: > > All, > > > > TORQUE 4.1.3 is now released. This has more than 50 fixes against 4.1.2, > and > > is doing much better stability-wise in production environments (we have > > several users that were running in production from 4.1-fixes until 4.1.3 > > came out). > > > > Enjoy, > > > > -- > > David Beer | Senior Software Engineer > > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121031/367cf261/attachment-0001.html From john.hanks at usu.edu Wed Oct 31 11:46:43 2012 From: john.hanks at usu.edu (John Hanks) Date: Wed, 31 Oct 2012 11:46:43 -0600 Subject: [torqueusers] 4.1.3 Is Now Released In-Reply-To: References: Message-ID: Bug 219 has been added. Along the way I tried building from an svn co of 4.1.3 and running autogen.sh there created working *.in files. I think this is an autoconf version issue and am assuming that tarball was created on a system with an older autoconf install. Link to bug is http://www.clusterresources.com/bugzilla/show_bug.cgi?id=219 jbh On Wed, Oct 31, 2012 at 11:05 AM, David Beer wrote: > John, > > Can you log this as a bug in our bugzilla so we can track it? (sometimes the > email reports don't get followed up with) Our bugzilla is: > http://www.clusterresources.com/bugzilla/ > > David > > On Wed, Oct 31, 2012 at 8:10 AM, John Hanks wrote: >> >> I'm building this on Centos 6.3 and ran into a problem with >> --enable-drmaa. When DRMAA is enabled, make install attempts to >> install the drmaa docs in /doc, ignoring the --prefix settings. I >> tracked this down to datarootdir missing from the Makefile.in files >> and got the install to work by adding it to Makefile.in with >> >> find . -name Makefile.in -exec sed -i 's/datadir = @datadir@/datadir = >> @datadir@\ndatarootdir = @datarootdir@/g' {} \; >> >> then running ./autogen.sh before the configure step. >> >> Thanks, >> >> jbh >> >> On Mon, Oct 29, 2012 at 5:44 PM, David Beer >> wrote: >> > All, >> > >> > TORQUE 4.1.3 is now released. This has more than 50 fixes against 4.1.2, >> > and >> > is doing much better stability-wise in production environments (we have >> > several users that were running in production from 4.1-fixes until 4.1.3 >> > came out). >> > >> > Enjoy, >> > >> > -- >> > David Beer | Senior Software Engineer >> > Adaptive Computing >> > >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Senior Software Engineer > Adaptive Computing > From dbeer at adaptivecomputing.com Wed Oct 31 11:48:43 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 31 Oct 2012 11:48:43 -0600 Subject: [torqueusers] 4.1.3 Is Now Released In-Reply-To: References: Message-ID: Thanks John. On Wed, Oct 31, 2012 at 11:46 AM, John Hanks wrote: > Bug 219 has been added. Along the way I tried building from an svn co > of 4.1.3 and running autogen.sh there created working *.in files. I > think this is an autoconf version issue and am assuming that tarball > was created on a system with an older autoconf install. Link to bug is > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=219 > > jbh > > On Wed, Oct 31, 2012 at 11:05 AM, David Beer > wrote: > > John, > > > > Can you log this as a bug in our bugzilla so we can track it? (sometimes > the > > email reports don't get followed up with) Our bugzilla is: > > http://www.clusterresources.com/bugzilla/ > > > > David > > > > On Wed, Oct 31, 2012 at 8:10 AM, John Hanks wrote: > >> > >> I'm building this on Centos 6.3 and ran into a problem with > >> --enable-drmaa. When DRMAA is enabled, make install attempts to > >> install the drmaa docs in /doc, ignoring the --prefix settings. I > >> tracked this down to datarootdir missing from the Makefile.in files > >> and got the install to work by adding it to Makefile.in with > >> > >> find . -name Makefile.in -exec sed -i 's/datadir = @datadir@/datadir = > >> @datadir@\ndatarootdir = @datarootdir@/g' {} \; > >> > >> then running ./autogen.sh before the configure step. > >> > >> Thanks, > >> > >> jbh > >> > >> On Mon, Oct 29, 2012 at 5:44 PM, David Beer < > dbeer at adaptivecomputing.com> > >> wrote: > >> > All, > >> > > >> > TORQUE 4.1.3 is now released. This has more than 50 fixes against > 4.1.2, > >> > and > >> > is doing much better stability-wise in production environments (we > have > >> > several users that were running in production from 4.1-fixes until > 4.1.3 > >> > came out). > >> > > >> > Enjoy, > >> > > >> > -- > >> > David Beer | Senior Software Engineer > >> > Adaptive Computing > >> > > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Senior Software Engineer > > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121031/ae47335e/attachment.html From mauede at alice.it Wed Oct 31 21:44:46 2012 From: mauede at alice.it (mauede at alice.it) Date: Thu, 1 Nov 2012 04:44:46 +0100 (CET) Subject: [torqueusers] how to browse the stdout and stderr files of a running batch job Message-ID: <13aba123705.mauede@alice.it> I submit long Monte Carlo simulations to the PBS scheduler. I redirect each simulation stdout and stderr to a uniquely named file. Unluckily, as far as I know, PBS does not allow me to peep at such a file before the simulation is finished, whether it has completed OK or has aborted. Some super-computer centers have developed commands, like qcat, qpeep, that allow for monitoring number-cruncher programs and browse their stdout and stderr. This feature helps saving a lot of CPU time when the submitted jobs are not doing what is expected. Is there a free-ware implementation of qcat or qpeep from some websites ? Thank you in advance. Regards, maura -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121101/cc98685b/attachment.html