From JMRUSHTON at qinetiq.com Wed Jan 2 04:08:31 2013 From: JMRUSHTON at qinetiq.com (Rushton Martin) Date: Wed, 2 Jan 2013 11:08:31 -0000 Subject: [torqueusers] UC multiple jobs: yes, no, maybe Message-ID: <20130102112425.D2B0B11C84F9@http.supercluster.org> Be careful with changing NODEACCESSPOLICY. We have had to set it to UNIQUEUSER (under MOAB) which only allows one job per user on a given node. The job epilogue does clean up of shared memory structures based upon the username, not the process. We found this out the hard way when user jobs just stopped dead, due to clean up from another job from the same user. The manual recommends UNIQUEUSER and we haven't had that particular problem since. I realise this is exactly what Jack's users do NOT want and the problem won't occur for serial jobs, only for multiple parallel jobs on a given node. Caveat administrator! Regards, Martin -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Andr? Gem?nd Sent: 21 December 2012 07:25 To: Torque Users Mailing List Subject: Re: [torqueusers] multiple jobs: yes, no, maybe Hi Jack, ----- Urspr?ngliche Mail ----- > The user wanted to know if it was possible to have more than one job > running on a node at a time? I honestly didn?t have an answer for him. it is possible. If you use Maui, you can set NODEACCESSPOLICY to shared (allow multiple jobs) or singleuser (allow multiple jobs from same user). That way you can allow one job per core. The problem is usually misbehaving software. A job can request only one core, but still use the whole node. E.g. codes that use OpenMP will usually use the whole machine. The first step against that would be to set the node availability computation to combined (respecting utilized resources instead of only "reserved" resources): http://www.adaptivecomputing.com/resources/docs/maui/5.4nodeavailability.php Then you can restrict the available resources of a job using CPUsets (http://www.adaptivecomputing.com/resources/docs/torque/2-5-12/help.htm#topics/3-nodes/linuxCpusetSupport.htm). > My feeling is that the whole purpose of the cluster is to give the > power of a whole node to process a job and not have to share it. Really depends on the workload. If you have many single core independent jobs (more HTC nature), you'll want to allow one job per core. You could set your queue resource default to allocate a whole node, but let users specifically request single cores, or even use a dedicated queue for that. Greetings Andr? -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. From Ole.H.Nielsen at fysik.dtu.dk Wed Jan 2 07:48:49 2013 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Wed, 02 Jan 2013 15:48:49 +0100 Subject: [torqueusers] NHC check for mcelog errors In-Reply-To: <20121130020310.GG8827@lbl.gov> References: <50B372C0.3070300@fysik.dtu.dk> <20121127011930.GG8827@lbl.gov> <50B5F093.5060608@fysik.dtu.dk> <20121130020310.GG8827@lbl.gov> Message-ID: <50E448D1.4000008@fysik.dtu.dk> On 11/30/2012 03:03 AM, Michael Jennings wrote: > I've added check_hw_mcelog to SVN. Feel free to download and give it > a try. The SVN repository is at: https://warewulf.lbl.gov/svn/trunk/nhc > Packages can be easily built from SVN via: > $ ./autogen.sh && make distcheck && rpmbuild -ta warewulf-nhc-*.tar.gz > > The check is pretty simple at the moment. If mcelog --client returns > anything (other than "Connection refused," meaning no daemon is > running), the check fails. Otherwise, it passes. We've been running NHC (SVN version) for a month now, and a few days ago one node had a Machine Check Exception (MCE) event which was reported in the NHC logfile as expected: Running check: "check_hw_mcelog" check_hw_mcelog(): MCEs detected: Memory errors SOCKET 1 CHANNEL any DIMM any corrected memory errors: 1 total 0 in 24h uncorrected memory errors: 0 total 0 in 24h SOCKET 1 CHANNEL 2 DIMM any corrected memory errors: 1 total 1 in 24h uncorrected memory errors: 0 total 0 in 24h Health check failed: MCEs detected in log. 20121231 12:03:01 /usr/libexec/nhc/node-mark-offline g032 MCEs detected in log. /usr/libexec/nhc/node-mark-offline: Marking job-exclusive g032 offline: NHC: MCEs detected in log. This was probably a transient memory hardware error, but it seems that the NHC check_hw_mcelog() is doing its job correctly. Best regards, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark From nlindberg at mkei.org Wed Jan 2 10:57:46 2013 From: nlindberg at mkei.org (Nick Lindberg) Date: Wed, 2 Jan 2013 17:57:46 +0000 Subject: [torqueusers] MCM stat generation errors: TPC and IC Message-ID: Hello, I've started to see some errors occur while trying to generate reports using the MCM in regards to system utilization. The specific error from the log is: java.lang.IllegalStateException: Size of TPC sequence (15601) returned by Moab Workload Manager not same size as IC (17574) I am running MCM 7.1 on a Mac. Moab: 7.1 as well. Has anybody seen this before or know how to fix it? I need to be able to generate these reports for our institutional users as accounting measures. I can run the command just fine on the head node, indicating to me that it's a problem more with some preconceived notion MCM has about what Moab should return. Here is the full output of the log file: 02 Jan 2013 11:52:06 INFO [Credential Based Chart Generator] com.moab.api.connect.MoabConnection: Now running '/opt/moab/bin/mcredctl -q profile user:pliu --format=xml --timeout=00:10:00 -o time:1338654600,1354470300,types:TPSD,IC,TPC' 02 Jan 2013 11:52:07 DEBUG [Credential Based Chart Generator] com.moab.api.global.XMLDebuggingTools: First 2500 chars returned for '/opt/moab/bin/mcredctl -q profile user:nlindberg --format=xml --timeout=00:10:00 -o time:1338654600,1354470300,types:TPSD,IC,TPC' check_hw_mcelog(): MCEs detected: Memory errors > SOCKET 1 CHANNEL any DIMM any > corrected memory errors: > 1 total > 0 in 24h > uncorrected memory errors: > 0 total > 0 in 24h > > SOCKET 1 CHANNEL 2 DIMM any > corrected memory errors: > 1 total > 1 in 24h > uncorrected memory errors: > 0 total > 0 in 24h > Health check failed: MCEs detected in log. > 20121231 12:03:01 /usr/libexec/nhc/node-mark-offline g032 MCEs > detected in log. > /usr/libexec/nhc/node-mark-offline: Marking job-exclusive g032 > offline: NHC: MCEs detected in log. > > This was probably a transient memory hardware error, but it seems > that the NHC check_hw_mcelog() is doing its job correctly. Yay! Thanks for the feedback. :) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From samuel at unimelb.edu.au Thu Jan 3 16:35:00 2013 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 04 Jan 2013 10:35:00 +1100 Subject: [torqueusers] NHC check for mcelog errors In-Reply-To: <50B372C0.3070300@fysik.dtu.dk> References: <50B372C0.3070300@fysik.dtu.dk> Message-ID: <50E615A4.6000402@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 27/11/12 00:46, Ole Holm Nielsen wrote: > At SC12 I suggested to you to add a check to Node Health Check to > inquire the mcelog daemon for hardware errors and offline sick > nodes in Torque. It's worth nothing that our SGI hardware (rebadged SuperMicro boxes, dual socket quad core Nehalem) runs a different MCE code (called memlog) to the standard RHEL/CentOS one as their code can pull more info out. That reports to /var/log/memlog/memlog.log. Also our IBM iDataplex nodes log MCEs via their IMM/IPMI/BMC/RSA/whatever-its-called-today controller and so we monitor for those by parsing the output of "ipmitool sel elist". cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDmFaQACgkQO2KABBYQAh+6iwCZAZl63YkK5xDTwV65qOQi3+xG u/EAn2qbc7TF0rLn2l0PtwxEXiYMJiAM =T116 -----END PGP SIGNATURE----- From sternberg at anl.gov Fri Jan 4 15:08:16 2013 From: sternberg at anl.gov (Michael Sternberg) Date: Fri, 4 Jan 2013 16:08:16 -0600 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> Message-ID: <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> I just encountered the same problem (CentOS-5.8) for torque-4.1.4: Tracing "bash -x ./configure", I found that the script needs to be told to actually *use* pkg-config, otherwise on gets numerous occasions of "pkg_failed=untried". Thus: PKG_CONFIG=pkg-config ./configure ? With best regards, -- Michael Sternberg Argonne National Laboratory On 2012-12-13, at 16:30 , "Andrus, Brian Contractor" wrote: > Yep, the default locations for the hwloc libs are /usr/include and /usr/lib and /usr/lib64 > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer > Sent: Thursday, December 13, 2012 1:51 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] hwloc error torque 4.1.3 > > Have you made sure you are specifying --with-hwloc-path correctly? /usr/ seems unlikely to be correct. Note the coment that configure adds: > > Specifies the path to the hwloc libraries and include files. > Example: ./configure --with-hwloc-path=/usr/local/hwloc-1.1 > Will specify that the include files are in /usr/local/hwloc-1.1/include and > the libraries are in /usr/local/hwloc-1.1/lib], > > So specifying /usr would mean that > > /usr/include contains the hwloc headers and /usr/lib/ contains the library. > > David From ianm at uchicago.edu Sun Jan 6 16:24:03 2013 From: ianm at uchicago.edu (Ian Miller) Date: Sun, 6 Jan 2013 23:24:03 +0000 Subject: [torqueusers] Strange error with node connection to outside NIC In-Reply-To: <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> Message-ID: <843FE493E7B6CA42A6C4682D63AE2D951179BD25@XM-MBX-02-PROD.ad.uchicago.edu> Hi all, I just ran into this problem and i'm having issues resolving it. The client seems to want to connection to the outside NIC of are head node. In "server_name" I have listed the internal NIC name of the head node and an entry in the local hosts file for the internal NIC of the head node. Their is nothing listed in DNS for the internal NIC IP which is a 192.168. Address. This has been working without problem for a while and I'm not sure what changed but at this point I really don't care. I'm looking for a fix. Error below from client logs ... Any thoughts . Thx for any help pbs_mom: LOG_ERROR::No route to host (113) in TMomFinalizeChild, cannot open interactive qsub socket to host external.domain.edu:44558 - 'cannot connect to port 777 in client_to_svr - errno:113 No route to host' - check routing tables/multi-homed host issues Ian Miller Research Computing Administrator ianm at uchicago.edu On 1/4/13 4:08 PM, "Michael Sternberg" wrote: >I just encountered the same problem (CentOS-5.8) for torque-4.1.4: > >Tracing "bash -x ./configure", I found that the script needs to be told >to actually *use* pkg-config, otherwise on gets numerous occasions of >"pkg_failed=untried". Thus: > > PKG_CONFIG=pkg-config ./configure ? > > >With best regards, >-- >Michael Sternberg >Argonne National Laboratory > > >On 2012-12-13, at 16:30 , "Andrus, Brian Contractor" >wrote: > >> Yep, the default locations for the hwloc libs are /usr/include and >>/usr/lib and /usr/lib64 >> >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 >> >> >> >> From: torqueusers-bounces at supercluster.org >>[mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer >> Sent: Thursday, December 13, 2012 1:51 PM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] hwloc error torque 4.1.3 >> >> Have you made sure you are specifying --with-hwloc-path correctly? >>/usr/ seems unlikely to be correct. Note the coment that configure adds: >> >> Specifies the path to the hwloc libraries and include files. >> Example: ./configure >>--with-hwloc-path=/usr/local/hwloc-1.1 >> Will specify that the include files are in >>/usr/local/hwloc-1.1/include and >> the libraries are in /usr/local/hwloc-1.1/lib], >> >> So specifying /usr would mean that >> >> /usr/include contains the hwloc headers and /usr/lib/ contains the >>library. >> >> David >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From cwest at vpac.org Sun Jan 6 22:55:20 2013 From: cwest at vpac.org (Craig West) Date: Mon, 07 Jan 2013 16:55:20 +1100 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> Message-ID: <50EA6348.6060101@vpac.org> Michael, > I just encountered the same problem (CentOS-5.8) for torque-4.1.4: We have seen the same issue for Centos 6.3, and had previously followed the instructions given in the output - which did work. "If you have done these and still get this error, try running ./autogen.sh and then configuring again." > Tracing "bash -x ./configure", I found that the script needs to be > told to actually *use* pkg-config, otherwise on gets numerous > occasions of "pkg_failed=untried". Thus: > > PKG_CONFIG=pkg-config ./configure ? I tested your method above and it was able to run the configure (followed by a make) without the need for the autogen. This seems strange, and something that was not happening with torque 4.1.3. Has anyone raised a bug about this? Does Adaptive have an internal bug? Thanks, Craig. From dbeer at adaptivecomputing.com Mon Jan 7 10:37:04 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 7 Jan 2013 10:37:04 -0700 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <50EA6348.6060101@vpac.org> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> <50EA6348.6060101@vpac.org> Message-ID: On Sun, Jan 6, 2013 at 10:55 PM, Craig West wrote: > > Michael, > > > I just encountered the same problem (CentOS-5.8) for torque-4.1.4: > > We have seen the same issue for Centos 6.3, and had previously followed > the instructions given in the output - which did work. > > "If you have done these and still get this error, try running > ./autogen.sh and then configuring again." > > > Tracing "bash -x ./configure", I found that the script needs to be > > told to actually *use* pkg-config, otherwise on gets numerous > > occasions of "pkg_failed=untried". Thus: > > > > PKG_CONFIG=pkg-config ./configure ? > > I tested your method above and it was able to run the configure > (followed by a make) without the need for the autogen. > > This seems strange, and something that was not happening with torque > 4.1.3. Has anyone raised a bug about this? Does Adaptive have an > internal bug? > > Adaptive doesn't have an internal bug yet but we'll find a way to fix this in configure.ac. I'm guessing you can add a line like: PKG_CONFIG=pkg-config to the top of configure.ac to fix this. I'm not sure yet and I haven't had any time to look into it. David > Thanks, > Craig. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130107/27a4c9ee/attachment.html From sternberg at anl.gov Mon Jan 7 16:30:14 2013 From: sternberg at anl.gov (Michael Sternberg) Date: Mon, 7 Jan 2013 17:30:14 -0600 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> <50EA6348.6060101@vpac.org> Message-ID: Craig, David, On 2013-01-07, at 11:37 , David Beer wrote: > On Sun, Jan 6, 2013 at 10:55 PM, Craig West wrote: >> > Tracing "bash -x ./configure", I found that the script needs to be >> > told to actually *use* pkg-config, otherwise on gets numerous >> > occasions of "pkg_failed=untried". Thus: >> > >> > PKG_CONFIG=pkg-config ./configure ? >> >> I tested your method above and it was able to run the configure (followed by a make) without the need for the autogen. >> >> This seems strange, and something that was not happening with torque 4.1.3. Has anyone raised a bug about this? Does Adaptive have an internal bug? > > Adaptive doesn't have an internal bug yet but we'll find a way to fix this in configure.ac. I'm guessing you can add a line like: > PKG_CONFIG=pkg-config > to the top of configure.ac to fix this. I'm not sure yet and I haven't had any time to look into it. Great - Thanks! I actually ended up inserting the PKG_CONFIG line right below CXXFLAGS in the spec file (which I use to get local rpms). Michael From Ole.H.Nielsen at fysik.dtu.dk Tue Jan 8 01:05:11 2013 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Tue, 08 Jan 2013 09:05:11 +0100 Subject: [torqueusers] NHC check for mcelog errors (Christopher Samuel) Message-ID: <50EBD337.2020906@fysik.dtu.dk> Christopher Samuel wrote: > It's worth nothing that our SGI hardware (rebadged SuperMicro boxes, > dual socket quad core Nehalem) runs a different MCE code (called > memlog) to the standard RHEL/CentOS one as their code can pull more > info out. That reports to /var/log/memlog/memlog.log. Sounds interesting. I googled for memlogd and apparently it reports solely memory errors, which is certainly a very important check. Perhaps Michael could be persuaded to write yet another Node Health Check for memlog, but he would have to check for the addition of new errors to the cumulative logfile. > Also our IBM iDataplex nodes log MCEs via their > IMM/IPMI/BMC/RSA/whatever-its-called-today controller and so we > monitor for those by parsing the output of "ipmitool sel elist". The BMC error log is worth checking whenever a node is really broken, but I doubt whether it makes sense for running frequently as an NHC check for these reasons: * You need to start the IPMI service ("service ipmi start" on RHEL6), and in my experience with Intel systems this always incurs an unwarranted additional CPU load of 1.0 :-( * The time it takes to start IPMI, run "ipmitool sel elist", then stop IPMI is substantial (several seconds), hence it may not be suitable as an NHC check which should be as quick as possible. * The IPMI error log is cumulative, so one would have to look for changes. Also, some BMCs do not seem to have reliable time/date, making timestamps unreliable. My 2 cents worth... Best regards, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark From Ole.H.Nielsen at fysik.dtu.dk Tue Jan 8 02:22:37 2013 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Tue, 08 Jan 2013 10:22:37 +0100 Subject: [torqueusers] SELinux warning when pbs_server sends E-mail message Message-ID: <50EBE55D.3060300@fysik.dtu.dk> We're running Torque 2.3.7 on a central Torque server running RHEL6.3 OS (this old version of Torque is *required* for stable use with the Maui scheduler, see an older thread in this list). We're seeing the following syslog message every time a job completes and sends an E-mail message to the user: setroubleshoot: SELinux is preventing /usr/sbin/sendmail.sendmail from write access on the file /var/spool/torque/server_priv/server.lock. SELinux is enabled in permissive mode, so this is not a severe problem, but it's still a nuisance to have extraneous syslog messages. I prefer having SELinux enabled in order to log security related events. I looked at the Torque code server/svr_mail.c which opens a pipe to execute Sendmail, writes some data and then closes the pipe. The pbs_server's lockfile filename is never written to the Sendmail pipe, so why on earth would SELinux complain about Sendmail trying to write to that lockfile?? Could it be because svr_mail.c closes the pipe by fclose(outmail) in stead of pclose(outmail) as is done in the Torque 2.5 code? Question: Anyone running a Torque pbs_server with SElinux enabled, do you also see SELinux warnings like the above one? What's your Torque version? Thanks, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark From samuel at unimelb.edu.au Tue Jan 8 23:55:36 2013 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 09 Jan 2013 17:55:36 +1100 Subject: [torqueusers] NHC check for mcelog errors (Christopher Samuel) In-Reply-To: <50EBD337.2020906@fysik.dtu.dk> References: <50EBD337.2020906@fysik.dtu.dk> Message-ID: <50ED1468.5090105@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Ole, On 08/01/13 19:05, Ole Holm Nielsen wrote: > The BMC error log is worth checking whenever a node is really > broken, but I doubt whether it makes sense for running frequently > as an NHC check for these reasons: > > * You need to start the IPMI service ("service ipmi start" on > RHEL6), and in my experience with Intel systems this always incurs > an unwarranted additional CPU load of 1.0 :-( We're running RHEL 5 so it doesn't do that. It's worth checking though if it's actually consuming CPU or just sitting in a device wait. If it's just in a device wait then it won't have any impact. > * The time it takes to start IPMI, run "ipmitool sel elist", then > stop IPMI is substantial (several seconds), hence it may not be > suitable as an NHC check which should be as quick as possible. I think that depends on how often you run them, we run ours every 15 minutes and they write to a file in /dev/shm. The check that pbs_mom runs just cats that file (or produces an error if it's not present). > * The IPMI error log is cumulative, so one would have to look for > changes. We clean our IPMI logs after the node has been fixed. > Also, some BMCs do not seem to have reliable time/date, making > timestamps unreliable. Because our scripts log to syslog when they find a problem then we should know to a 15 minute window when the problem occurred. > My 2 cents worth... I think it's worth a lot more than that, they're all good points that need to be considered! cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlDtFGgACgkQO2KABBYQAh+WOgCfS7nfvcmt1zM+dR/jteLPh2Sj 8REAniabf6Lzom+RlTvVUT1HUSzACrzq =U/Bq -----END PGP SIGNATURE----- From Ole.H.Nielsen at fysik.dtu.dk Wed Jan 9 01:00:20 2013 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Wed, 09 Jan 2013 09:00:20 +0100 Subject: [torqueusers] NHC check for mcelog errors (Christopher Samuel) In-Reply-To: <50ED1468.5090105@unimelb.edu.au> References: <50ED1468.5090105@unimelb.edu.au> Message-ID: <50ED2394.5010106@fysik.dtu.dk> Christopher Samuel wrote: >> The BMC error log is worth checking whenever a node is really >> broken, but I doubt whether it makes sense for running frequently >> as an NHC check for these reasons: >> >> * You need to start the IPMI service ("service ipmi start" on >> RHEL6), and in my experience with Intel systems this always incurs >> an unwarranted additional CPU load of 1.0 :-( > We're running RHEL 5 so it doesn't do that. It's worth checking > though if it's actually consuming CPU or just sitting in a device > wait. If it's just in a device wait then it won't have any impact. We see the extra CPU load on both CentOS5 and CentOS6 nodes. How does one determine if a CPU load is "harmless"? We're also seeing extra CPU loads of about 0.5 on our new nodes with QDR Infiniband adapters, perhaps that's also due to device waits? We check the compute nodes' CPU load in order to identify badly behaving jobs, so even "harmless" CPU loads disturb our monitoring. >> * The time it takes to start IPMI, run "ipmitool sel elist", then >> stop IPMI is substantial (several seconds), hence it may not be >> suitable as an NHC check which should be as quick as possible. > > I think that depends on how often you run them, we run ours every 15 > minutes and they write to a file in /dev/shm. The check that pbs_mom > runs just cats that file (or produces an error if it's not present). Interesting! Perhaps you could share your scripts and setup with the Torque community? Do you see any job performance impacts due to running IPMI commands on busy nodes? >> * The IPMI error log is cumulative, so one would have to look for >> changes. > > We clean our IPMI logs after the node has been fixed. Yes, but our (older) BMCs log a lot of irrelevant stuff which do not warrant the offlining of the nodes. How would one identify genuinely broken hardware from the BMC error logs? >> Also, some BMCs do not seem to have reliable time/date, making >> timestamps unreliable. > > Because our scripts log to syslog when they find a problem then we > should know to a 15 minute window when the problem occurred. Yes, this sounds like a good method. Thanks, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark From jonathan.barber at gmail.com Wed Jan 9 03:24:44 2013 From: jonathan.barber at gmail.com (Jonathan Barber) Date: Wed, 9 Jan 2013 10:24:44 +0000 Subject: [torqueusers] NHC check for mcelog errors (Christopher Samuel) In-Reply-To: <50ED2394.5010106@fysik.dtu.dk> References: <50ED1468.5090105@unimelb.edu.au> <50ED2394.5010106@fysik.dtu.dk> Message-ID: On 9 January 2013 08:00, Ole Holm Nielsen wrote: > Christopher Samuel wrote: >>> The BMC error log is worth checking whenever a node is really >>> broken, but I doubt whether it makes sense for running frequently >>> as an NHC check for these reasons: >>> >>> * You need to start the IPMI service ("service ipmi start" on >>> RHEL6), and in my experience with Intel systems this always incurs >>> an unwarranted additional CPU load of 1.0 :-( > >> We're running RHEL 5 so it doesn't do that. It's worth checking >> though if it's actually consuming CPU or just sitting in a device >> wait. If it's just in a device wait then it won't have any impact. > > We see the extra CPU load on both CentOS5 and CentOS6 nodes. How does > one determine if a CPU load is "harmless"? We're also seeing extra CPU > loads of about 0.5 on our new nodes with QDR Infiniband adapters, > perhaps that's also due to device waits? Pedantic point: it's "load" and not "CPU load", because it reflects the number of processes waiting to get a run slot and not CPU usage. >From RHEL6 proc(5) load is: ... number of jobs in the run queue (state R) or waiting for disk I/O (state D) To see what exactly the processes are stuck in, check the WCHAN status of the process: $ ps -e -w -o pid,pcpu,rss,vsize,state,cmd,wchan Whether it's harmless or not really depends on whether it affects your jobs... If it doesn't, it's harmless! > We check the compute nodes' CPU load in order to identify badly behaving > jobs, so even "harmless" CPU loads disturb our monitoring. If your nodes have a higher basal load, why not just increase your alerting threshold? If you don't want any additional load from the IPMI modules, then you can do IPMI-over-LAN from a central monitoring host (assuming you have configured it and the node and the interfaces are connected). This communicates directly with the BMC and thus has no effect on the compute node. You can normally configure the IPMI LAN interface with the ipmitool command: # ipmitool lan print ... # ipmitool lan set 2 ipaddr 192.168.1.1 See ipmitool(1) for the details. I think it's also worth checking the IPMI SDR in case something doesn't turn up in the SEL: # ipmitool sdr list Cheers -- Jonathan Barber From pc at pcable.net Tue Jan 8 16:20:21 2013 From: pc at pcable.net (Patrick Cable) Date: Tue, 8 Jan 2013 18:20:21 -0500 Subject: [torqueusers] X11 Forwarding issue in 4.1.3 Message-ID: Hi, I've got a master node with 4 compute nodes. All have xterm and xauth installed. If I 'ssh -X master' then from master run 'ssh -X node001' I can successfully get a window back to my desktop. However, if i 'ssh -X master' then 'qsub -X -I' - no dice. My $DISPLAY is set on master, but on node001 it's empty: interceptor:~ cable$ ssh -X cadgrid Last login: Tue Jan 8 18:10:22 2013 from [1818][cable at cadgrid:~]$ echo $DISPLAY localhost:14.0 [1818][cable at cadgrid:~]$ qsub -X -I qsub: waiting for job 149.cadgrid.asdf to start qsub: job 149.cadgrid.asdf ready [1818][cable at cadgrid004:~]$ echo $DISPLAY [1818][cable at cadgrid004:~]$ logout qsub: job 149.cadgrid.asdf completed I've tried increasing the pbs_mom debug logs to see if I could find anything, but had no luck. Any pointers on where else I should look? Thanks, - Patrick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130108/f9363da2/attachment.html From roy.dragseth at cc.uit.no Fri Jan 11 02:48:39 2013 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Fri, 11 Jan 2013 10:48:39 +0100 Subject: [torqueusers] Newly added compute nodes get offline state in torque 4.1.4. Message-ID: <1958797.tcJsvXzOJW@newton.cc.uit.no> When I'm adding new compute nodes into the pool they initially get the offline state that needs to be cleared manually. Is intended behaviour? It didn't use to be that way earlier (aka torque version 2 and 3). [root at hpc ~]# pbsnodes compute-0-2 compute-0-2 state = free np = 2 ntype = cluster status = rectime=1357897229,varattr=,jobs=,state=free,netload=331323,gres=,loadave=0.00,ncpus=2,physmem=2054724kb,availmem=2966412kb,totmem=3078716kb,idletime=528,nusers=0,nsessions=0,uname=Linux compute-0-2.local 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 [root at hpc ~]# qmgr -c "delete node compute-0-2" [root at hpc ~]# qmgr -c "create node compute-0-2 np=2,ntype=cluster" [root at hpc ~]# pbsnodes -l compute-0-2 compute-0-2 offline If this is intended behaviour I need to add some logic to my torque-roll for Rocks to clear the offline state automatically after insertion of new nodes. r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From j.blank at fz-juelich.de Fri Jan 11 05:31:03 2013 From: j.blank at fz-juelich.de (Joerg Blank) Date: Fri, 11 Jan 2013 13:31:03 +0100 Subject: [torqueusers] Torque 4.1.4: Running jobs discrepancy Message-ID: Hello everyone, I recently upgraded to Torque 4.1.4 and got the following problem. There seems to be a mix up regarding running jobs: Please note the discrepancy between "jobs" and "status/jobs" c-22 state = job-exclusive np = 8 properties = barcelona,bigmem ntype = cluster jobs = 0/29938[34].cluster, 1/29938[36].cluster, 2/29938[30].cluster, 3/29938[38].cluster, 4/29938[2].cluster, 5/29938[3].cluster, 6/29938[37].cluster, 7/29938[4].cluster status = rectime=1357906856,varattr=,jobs=29938[30].cluster,state=free,netload=360080362410,gres=,loadave=1.01,ncpus=8,physmem=66180608kb,availmem=128250632kb,totmem=133289468kb,idletime=411887,nusers=1,nsessions=1,sessions=9653,uname=Linux c-22 3.7.0.20121211 #4 SMP Tue Dec 11 20:52:45 CET 2012 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 Regards, J?rg Blank From dbeer at adaptivecomputing.com Fri Jan 11 10:17:42 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 11 Jan 2013 10:17:42 -0700 Subject: [torqueusers] Torque 4.1.4: Running jobs discrepancy In-Reply-To: References: Message-ID: Joerg, There are different potential reasons this could happen. For example, when jobs are started on a node, the server will have an immediate record of the jobs being there, but the mom's update is only sent every 45 seconds by default (some clusters increase this time), so it is completely feasible that jobs might have a period where no update since the jobs were started has been yet received by the mom. If you don't think this is due to the above-described scenario, can you provide some more details of what happens to get into this state? How long does this state persist? Does it get cleaned up? Do you have messages about rejected job obituaries in the server logs? David On Fri, Jan 11, 2013 at 5:31 AM, Joerg Blank wrote: > Hello everyone, > > I recently upgraded to Torque 4.1.4 and got the following problem. There > seems to be a mix up regarding running jobs: > > Please note the discrepancy between "jobs" and "status/jobs" > > c-22 > state = job-exclusive > np = 8 > properties = barcelona,bigmem > ntype = cluster > jobs = 0/29938[34].cluster, 1/29938[36].cluster, > 2/29938[30].cluster, 3/29938[38].cluster, 4/29938[2].cluster, > 5/29938[3].cluster, 6/29938[37].cluster, 7/29938[4].cluster > status = > > rectime=1357906856,varattr=,jobs=29938[30].cluster,state=free,netload=360080362410,gres=,loadave=1.01,ncpus=8,physmem=66180608kb,availmem=128250632kb,totmem=133289468kb,idletime=411887,nusers=1,nsessions=1,sessions=9653,uname=Linux > c-22 3.7.0.20121211 #4 SMP Tue Dec 11 20:52:45 CET 2012 x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > Regards, > J?rg Blank > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130111/1d6c49d5/attachment-0001.html From j.blank at fz-juelich.de Fri Jan 11 11:49:43 2013 From: j.blank at fz-juelich.de (Joerg Blank) Date: Fri, 11 Jan 2013 19:49:43 +0100 Subject: [torqueusers] Torque 4.1.4: Running jobs discrepancy In-Reply-To: References: Message-ID: Am 11.01.2013 18:17, schrieb David Beer: > If you don't think this is due to the above-described scenario, can you > provide some more details of what happens to get into this state? How > long does this state persist? Does it get cleaned up? Do you have > messages about rejected job obituaries in the server logs? I think I cleaned that up by fixing the job numbers in serverdb leftover bugged from before the 2.5.x upgrade. This happened on an arrayjob wie max run parameter "-t 0-100%20" Those jobs did not run (I checked on the node), but they used a slot in some functions (and did not show up in others). There were 20 Jobs not on hold, but whenever one was scheduled, it could not start because of the 20 jobs limitation. This created another ghost job. Regards, J?rg Blank From j.blank at fz-juelich.de Fri Jan 11 11:54:10 2013 From: j.blank at fz-juelich.de (Joerg Blank) Date: Fri, 11 Jan 2013 19:54:10 +0100 Subject: [torqueusers] Torque 4.1.4: Array job dependencies Message-ID: Hello everyone, One of my users tried to use this script: i=$(qsub sleeper1.sh) qsub -t 0-1 -W depend=afterany:$i sleeper2.sh This fails (silently to the user) after sleeper1.sh finished with the log message: 1/11/2013 19:43:50;0080;PBS_Server.697;Req;req_reject;Reject reply code=15004(Invalid request MSG=Arrays may only be given array dependencies), aux=0, type=RegisterDependency, from @cluster I do not see why an array job should not depend on a single job, as this worked fine on 2.5.x Regards, J?rg Blank From dbeer at adaptivecomputing.com Mon Jan 14 12:27:50 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 14 Jan 2013 12:27:50 -0700 Subject: [torqueusers] Different configure lines Message-ID: All, As you know we have been in the process of switching TORQUE to compile with g++ instead of gcc. As part of this, we're trying to test all of the different configure options that people use. If you have a minute, can you please send in your configure line? We just want to make sure we aren't breaking things. Thanks, -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130114/16eb265e/attachment.html From tbaer at utk.edu Mon Jan 14 12:42:35 2013 From: tbaer at utk.edu (Troy Baer) Date: Mon, 14 Jan 2013 14:42:35 -0500 Subject: [torqueusers] Different configure lines In-Reply-To: References: Message-ID: <1358192555.4494.9.camel@browncoat.jics.utk.edu> On Mon, 2013-01-14 at 12:27 -0700, David Beer wrote: > As you know we have been in the process of switching TORQUE to compile > with g++ instead of gcc. As part of this, we're trying to test all of > the different configure options that people use. If you have a minute, > can you please send in your configure line? We just want to make sure > we aren't breaking things. At NICS, it varies quite a bit based on TORQUE version and system architecture. Here are our configure lines for recent builds of TORQUE 4.1.x: # GPU cluster export VERSION=4.1.2 ./configure --prefix=/opt/torque/$VERSION \ --includedir=/opt/torque/$VERSION/include \ --disable-gcc-warnings \ --with-server-home=/var/spool/torque \ --enable-docs \ --disable-rpp \ --disable-shell-pipe \ --enable-shell-use-argv \ --enable-acct-x \ --with-sched=no \ --x-libraries=/usr/X11R6/lib64 \ --enable-nvidia-gpus \ --with-tcp-retry-limit=2 \ --with-pam=/lib64/security \ --enable-munge-auth \ --enable-drmaa \ --with-debug # Cray XE/XK export VERSION=4.1.3 export CFLAGS="-DTXT" ./configure --prefix=/opt/torque/$VERSION \ --with-server-home=/var/spool/torque \ --enable-docs \ --disable-rpp \ --with-sched=no \ --enable-acct-x \ --disable-drmaa \ --with-tcp-retry-limit=2 \ --enable-syslog \ --enable-maxdefault \ --disable-shell-pipe \ --enable-shell-use-argv \ --enable-munge-auth \ --with-job_create \ --disable-gcc-warnings \ --includedir=/opt/torque/$VERSION/include \ --with-debug # SGI UV10/UV1000 (NUMA) export VERSION=4.1.2 export CFLAGS="-I/opt/hwloc/default/include -DTXT" export LDFLAGS="-L/opt/hwloc/default/lib -lhwloc" export LD_LIBRARY_PATH="/opt/hwloc/default/lib:$LD_LIBRARY_PATH" ./configure --prefix=/opt/torque/$VERSION \ --disable-gcc-warnings \ --with-server-home=/var/spool/torque \ --enable-munge-auth \ --disable-rpp \ --enable-cpuset \ --enable-numa-support \ --enable-docs \ --disable-shell-pipe --enable-shell-use-argv \ --enable-acct-x \ --disable-blcr \ --disable-cpa \ --disable-csa \ --enable-job-create \ --with-sched=no \ --with-hwloc-path=/opt/hwloc/default \ --enable-server \ --enable-mom \ --enable-clients \ --enable-drmaa \ --x-libraries=/usr/X11R6/lib64 --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From samuel at unimelb.edu.au Mon Jan 14 22:31:38 2013 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 15 Jan 2013 16:31:38 +1100 Subject: [torqueusers] Different configure lines In-Reply-To: References: Message-ID: <50F4E9BA.8020705@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/01/13 06:27, David Beer wrote: > As you know we have been in the process of switching TORQUE to compile > with g++ instead of gcc. As part of this, we're trying to test all of > the different configure options that people use. If you have a minute, > can you please send in your configure line? We just want to make sure we > aren't breaking things. We use a script for easy automation: #!/bin/bash export SRCDIR="$(pwd)" export PREFIX="/usr/local/$(basename $(pwd) | sed 's#-#/#')" # --enable-cpuset - turns on Linux cpuset support # --with-pam - build the Torque PAM module to see if user has job on node # --with-sched=no - we use Moab, no need to build pbs_sched ./configure --enable-cpuset --prefix=${PREFIX} --with-pam --with-sched=no - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlD06boACgkQO2KABBYQAh/ONQCghwQltXxAfTtREnC6afEOCTCs kMkAn1XQaChuN5lkhZUDkuQRkexUrkoa =npTw -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Mon Jan 14 23:10:28 2013 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 15 Jan 2013 17:10:28 +1100 Subject: [torqueusers] NHC check for mcelog errors (Christopher Samuel) In-Reply-To: <50ED2394.5010106@fysik.dtu.dk> References: <50ED1468.5090105@unimelb.edu.au> <50ED2394.5010106@fysik.dtu.dk> Message-ID: <50F4F2D4.9060001@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/01/13 19:00, Ole Holm Nielsen wrote: > Christopher Samuel wrote: > >> We're running RHEL 5 so it doesn't do that. It's worth checking >> though if it's actually consuming CPU or just sitting in a >> device wait. If it's just in a device wait then it won't have >> any impact. > > We see the extra CPU load on both CentOS5 and CentOS6 nodes. Very odd. We don't see that here on RHEL5. memlogd does add a load of 1 to a node, but it's not using any CPU. > How does one determine if a CPU load is "harmless"? "top" is your friend, test on a completely idle node and use 'i' to tell it to not show idle tasks. Things that are in 'D' are in a device wait. > We're also seeing extra CPU loads of about 0.5 on our new nodes > with QDR Infiniband adapters, perhaps that's also due to device > waits? It is possible. > We check the compute nodes' CPU load in order to identify badly > behaving jobs, so even "harmless" CPU loads disturb our > monitoring. We use cpusets to restrict jobs to just the cores they use, and Open-MPI with TM support to make sure MPI jobs only launch where they are supposed to. If a job uses more cores than its meant to inside its cpuset then it only harms itself. [IPMI] >> I think that depends on how often you run them, we run ours every >> 15 minutes and they write to a file in /dev/shm. The check that >> pbs_mom runs just cats that file (or produces an error if it's >> not present). > > Interesting! Perhaps you could share your scripts and setup with > the Torque community? Could do, would need to find a good way to put them up though. > Do you see any job performance impacts due to running IPMI > commands on busy nodes? We've not had any complaints, but then we're catering to life sciences where it's not uncommon to see people running Java, R, Perl and Python as their HPC codes.. :-( > Yes, but our (older) BMCs log a lot of irrelevant stuff which do > not warrant the offlining of the nodes. How would one identify > genuinely broken hardware from the BMC error logs? At the moment we just look for errors we've seen in the past, and check that the log isn't full. Not perfect, but it's worked OK so far. :-) cheers! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlD08tQACgkQO2KABBYQAh969ACfT0mPJcwNY4iPlobsEeYtOPJs OLAAnizeaXwIjbBJescOzD6DnsqXWBrY =/AnR -----END PGP SIGNATURE----- From bdandrus at nps.edu Tue Jan 15 11:23:40 2013 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Tue, 15 Jan 2013 18:23:40 +0000 Subject: [torqueusers] Max interactive jobs? Message-ID: All, Is there a way in torque or moab to limit the total number of interactive jobs per user? I am not seeing one. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From ezellma at ornl.gov Tue Jan 15 15:10:40 2013 From: ezellma at ornl.gov (Ezell, Matthew A.) Date: Tue, 15 Jan 2013 17:10:40 -0500 Subject: [torqueusers] Different configure lines In-Reply-To: Message-ID: On 1/14/13 2:27 PM, "David Beer" wrote: >If you have a minute, can you please send in your configure line? On our Crays it looks something like: ./configure --prefix=/opt/torque/4.1.4 \ --with-server-home=/var/spool/torque \ --with-default-server=my.server.name \ --with-debug \ --with-job-create \ --with-tcp-retry-limit=200 \ CFLAGS="-DCRAY_MOAB_PASSTHRU -DTXT -DRERUNUSAGE -DUSESAVEDRESOURCES -DQSUBHOSTNAME" --- Matt Ezell HPC Systems Administrator Oak Ridge National Laboratory From carlos.borroto at gmail.com Tue Jan 15 15:22:50 2013 From: carlos.borroto at gmail.com (Carlos Borroto) Date: Tue, 15 Jan 2013 17:22:50 -0500 Subject: [torqueusers] Bad UID for job execution MSG=ruserok failed validating ... In-Reply-To: References: Message-ID: On Wed, Nov 7, 2012 at 5:21 PM, Carlos Borroto wrote: > rsh is not installed in any of the systems, I will like to keep it > like that. The system users are from LDAP, in case this is relevant. > The exact same configuration but using torque 2.5 packages from EPEL > which require munge, does work. I'm ok using munge, but haven't > figured out why with the packages I'm compiling do not work. Can't > even tell if munge is being query at all by torque. Ok, it took me some time, but I figured why my rpm packages didn't support munge. I didn't realize '--with munge' needed to be set in the Makefile. I thought by doing './configure --enable-munge-auth; make rpm' I was enabling munge authentication. This is not the case. In summary changing this line in the Makefile: RPM_AC_OPTS = --with syslog --with scp --without pam --with drmaa to: RPM_AC_OPTS = --with syslog --with scp --without pam --with drmaa --with munge Will do the trick. I still don't know why the other two options to not use 'ruserok' are not working. Either way, using munge works for me, so I'm a happy camper. Regards, Carlos From mej at lbl.gov Tue Jan 15 18:33:48 2013 From: mej at lbl.gov (Michael Jennings) Date: Tue, 15 Jan 2013 17:33:48 -0800 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <50EA6348.6060101@vpac.org> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> <50EA6348.6060101@vpac.org> Message-ID: <20130116013347.GP8827@lbl.gov> On Monday, 07 January 2013, at 16:55:20 (+1100), Craig West wrote: > > Tracing "bash -x ./configure", I found that the script needs to be > > told to actually *use* pkg-config, otherwise on gets numerous > > occasions of "pkg_failed=untried". Thus: > > > > PKG_CONFIG=pkg-config ./configure ? > > I tested your method above and it was able to run the configure > (followed by a make) without the need for the autogen. > > This seems strange, and something that was not happening with torque > 4.1.3. Has anyone raised a bug about this? Does Adaptive have an > internal bug? I'm still unable to reproduce this behavior, but I think I've found the cause. Try this patch: ------------------------------- diff --git a/configure.ac b/configure.ac index 22a3878..224c8be 100644 --- a/configure.ac +++ b/configure.ac @@ -78,7 +78,7 @@ else AC_DEFINE_UNQUOTED(SVN_VERSION, ["unknown"], [repository svn version]) fi - +PKG_PROG_PKG_CONFIG dnl dnl ###################################################################### -------------------------------- Let me know if that works. If so, I'll submit it upstream to Adaptive. It's poorly documented, but if PKG_CHECK_MODULES() is invoked for the first time inside a conditional, PKG_PROG_PKG_CONFIG must be invoked unconditionally prior to any PKG_CHECK_MODULES() calls. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From samuel at unimelb.edu.au Tue Jan 15 22:02:17 2013 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 16 Jan 2013 16:02:17 +1100 Subject: [torqueusers] Different configure lines In-Reply-To: <50F4E9BA.8020705@unimelb.edu.au> References: <50F4E9BA.8020705@unimelb.edu.au> Message-ID: <50F63459.9000707@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/01/13 16:31, Christopher Samuel wrote: > ./configure --enable-cpuset --prefix=${PREFIX} --with-pam > --with-sched=no You can add --enable-drmaa onto that - needed today for some of our folks who are doing bioinfomatics pipeline development. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlD2NFkACgkQO2KABBYQAh92qQCeNLc5jltVvB/cz1kevKhgBUy3 sqkAn0b4xbFzqftURjWdDWkQWZBds2cT =DpAs -----END PGP SIGNATURE----- From andre.gemuend at scai.fraunhofer.de Wed Jan 16 03:02:56 2013 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Wed, 16 Jan 2013 11:02:56 +0100 (CET) Subject: [torqueusers] Different configure lines In-Reply-To: Message-ID: <727886812.1045040.1358330576991.JavaMail.root@scai.fraunhofer.de> We are using rpmbuild -ta --define 'acflags --enable-cpuset --prefix=/usr' --with munge --with pam --define 'torque_server foo' torque-2.5.12.tar.gz here. Greetings ----- Urspr?ngliche Mail ----- > > All, > > > As you know we have been in the process of switching TORQUE to > compile with g++ instead of gcc. As part of this, we're trying to > test all of the different configure options that people use. If you > have a minute, can you please send in your configure line? We just > want to make sure we aren't breaking things. > > > Thanks, > > > -- > > David Beer | Senior Software Engineer > Adaptive Computing > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From knielson at adaptivecomputing.com Wed Jan 16 09:38:05 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 16 Jan 2013 09:38:05 -0700 Subject: [torqueusers] Max interactive jobs? In-Reply-To: References: Message-ID: You could do a max_user_queuable but that would limit everyone. Ken On Tue, Jan 15, 2013 at 11:23 AM, Andrus, Brian Contractor wrote: > All, > > Is there a way in torque or moab to limit the total number of interactive > jobs per user? I am not seeing one. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130116/f14fc5de/attachment.html From mamonski at man.poznan.pl Wed Jan 16 09:50:41 2013 From: mamonski at man.poznan.pl (=?UTF-8?Q?Mariusz_Mamo=C5=84ski?=) Date: Wed, 16 Jan 2013 17:50:41 +0100 Subject: [torqueusers] Max interactive jobs? In-Reply-To: References: Message-ID: what about creating a new queue, e.g. interactive, setting the max_user... as Ken suggest and setting in all other queues: set queue long disallowed_types = interactive ? On 15 January 2013 19:23, Andrus, Brian Contractor wrote: > All, > > Is there a way in torque or moab to limit the total number of interactive jobs per user? I am not seeing one. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Mariusz From mej at lbl.gov Wed Jan 16 11:39:16 2013 From: mej at lbl.gov (Michael Jennings) Date: Wed, 16 Jan 2013 10:39:16 -0800 Subject: [torqueusers] Bad UID for job execution MSG=ruserok failed validating ... In-Reply-To: References: Message-ID: <20130116183915.GX8827@lbl.gov> On Tuesday, 15 January 2013, at 17:22:50 (-0500), Carlos Borroto wrote: > Ok, it took me some time, but I figured why my rpm packages didn't > support munge. I didn't realize '--with munge' needed to be set in the > Makefile. I thought by doing './configure --enable-munge-auth; make > rpm' I was enabling munge authentication. This is not the case. > > In summary changing this line in the Makefile: > RPM_AC_OPTS = --with syslog --with scp --without pam --with drmaa > > to: > RPM_AC_OPTS = --with syslog --with scp --without pam --with drmaa --with munge > > Will do the trick. I still don't know why the other two options to not > use 'ruserok' are not working. Either way, using munge works for me, > so I'm a happy camper. This is the wrong way to do this. Your changes will be lost next time ./configure gets run. One way would be to make a similar change to Makefile.am and re-run autogen.sh and ./configure. However, the most correct way would be to modify configure.ac to support adding "--with munge" to RPM_AC_OPTS when munge support is activated. Here's an example done with syslog support: AC_ARG_ENABLE(syslog, [ --enable-syslog enable (default) the use of syslog for error reporting], [case "${enableval}" in yes) SYSLOG=1 ; RPM_AC_OPTS="$RPM_AC_OPTS --with syslog" ;; no) SYSLOG=0 ; RPM_AC_OPTS="$RPM_AC_OPTS --without syslog" ;; *) AC_MSG_ERROR(--enable-syslog cannot take a value.) ;; esac],[SYSLOG=1 ; RPM_AC_OPTS="$RPM_AC_OPTS --with syslog"])dnl AC_DEFINE_UNQUOTED(SYSLOG, ${SYSLOG}, [Define to enable syslog]) AM_CONDITIONAL(USING_SYSLOG, [test "$SYSLOG" = "1"]) The above example shows how the appropriate --with (or --without) option is appended to RPM_AC_OPTS based on whether or not syslog support is enabled by ./configure options. Make a similar change for munge support and e-mail the patch to the list. I can take it from there. :-) The same applies to any particular feature for which rpmbuild support is present (i.e., --with/--without ) but activating support on the ./configure command line (e.g., --enable-foo or --with-foo) does not affect the rpmbuild command invoked via "make rpm." Thanks! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From mej at lbl.gov Wed Jan 16 11:43:28 2013 From: mej at lbl.gov (Michael Jennings) Date: Wed, 16 Jan 2013 10:43:28 -0800 Subject: [torqueusers] Different configure lines In-Reply-To: <727886812.1045040.1358330576991.JavaMail.root@scai.fraunhofer.de> References: <727886812.1045040.1358330576991.JavaMail.root@scai.fraunhofer.de> Message-ID: <20130116184327.GY8827@lbl.gov> On Wednesday, 16 January 2013, at 11:02:56 (+0100), Andr? Gem?nd wrote: > We are using > > rpmbuild -ta --define 'acflags --enable-cpuset --prefix=/usr' --with munge --with pam --define 'torque_server foo' torque-2.5.12.tar.gz Just as an FYI, the "cleanest" and "most correct" way to do this would be: rpmbuild -ta --define '_prefix /usr' --define 'torque_server foo' --with cpuset --with munge --with pam torque-2.5.12.tar.gz HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From andre.gemuend at scai.fraunhofer.de Wed Jan 16 12:51:05 2013 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Wed, 16 Jan 2013 20:51:05 +0100 (CET) Subject: [torqueusers] Different configure lines In-Reply-To: <20130116184327.GY8827@lbl.gov> Message-ID: <1595584213.1057804.1358365865506.JavaMail.root@scai.fraunhofer.de> Last time I tried that the required macros were not available for prefix and cpuset, but it might be different right now... ----- Urspr?ngliche Mail ----- > On Wednesday, 16 January 2013, at 11:02:56 (+0100), > Andr? Gem?nd wrote: > > > We are using > > > > rpmbuild -ta --define 'acflags --enable-cpuset --prefix=/usr' > > --with munge --with pam --define 'torque_server foo' > > torque-2.5.12.tar.gz > > Just as an FYI, the "cleanest" and "most correct" way to do this > would be: > > rpmbuild -ta --define '_prefix /usr' --define 'torque_server foo' > --with cpuset --with munge --with pam torque-2.5.12.tar.gz > > HTH, > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From mej at lbl.gov Wed Jan 16 13:14:05 2013 From: mej at lbl.gov (Michael Jennings) Date: Wed, 16 Jan 2013 12:14:05 -0800 Subject: [torqueusers] Different configure lines In-Reply-To: <1595584213.1057804.1358365865506.JavaMail.root@scai.fraunhofer.de> References: <20130116184327.GY8827@lbl.gov> <1595584213.1057804.1358365865506.JavaMail.root@scai.fraunhofer.de> Message-ID: <20130116201404.GZ8827@lbl.gov> On Wednesday, 16 January 2013, at 20:51:05 (+0100), Andr? Gem?nd wrote: > Last time I tried that the required macros were not available for > prefix and cpuset, but it might be different right now... Manipulating _prefix has always been supported, but you're correct that support for "--with cpuset" was not present in older spec files. I checked the "2.5.12" branch in Git, and it seems to be there, so it should work fine. :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From thomas.zeiser at rrze.uni-erlangen.de Thu Jan 17 01:43:30 2013 From: thomas.zeiser at rrze.uni-erlangen.de (thomas.zeiser at rrze.uni-erlangen.de) Date: Thu, 17 Jan 2013 09:43:30 +0100 Subject: [torqueusers] Problems with torque/maui mailing lists: missing PTR for supercluster.org IP Message-ID: <20130117084330.GA15862@rrze.uni-erlangen.de> Hello, since probably beginning of December, I have severe problems receiving emails from the torque and maui mailing lists. Our mail server refuses to accept any emails from supercluster.org because of a missing PTR Resource Record in DNS. Because of the missing PTR, reverse DNS lookups do not work for supercluster.org's IP. Jan 16 11:01:58 bolte82 postfix-rz-in/smtpd[26316]: NOQUEUE: reject: RCPT from unknown[68.178.11.139]: 550 5.7.1 Client host rejected: cannot find your reverse hostname, [68.178.11.139]; from= to= proto=SMTP helo= $ host supercluster.org supercluster.org has address 68.178.11.139 supercluster.org mail is handled by 0 supercluster.org. supercluster.org mail is handled by 10 clusterresources.com. $ nslookup 68.178.11.139 ** server can't find 139.11.178.68.in-addr.arpa.: NXDOMAIN Google's Public DNS servers tells exactly the same: $ nslookup 68.178.11.139 - 8.8.8.8 Server: 8.8.8.8 Address: 8.8.8.8#53 ** server can't find 139.11.178.68.in-addr.arpa.: NXDOMAIN Thus, it does not seem to be a local issue at our side but something Adaptive Computing (or who ever owns supercluster.org) has to fix. Kind regards, Thomas Zeiser From roberto.nunnari at supsi.ch Thu Jan 17 07:44:13 2013 From: roberto.nunnari at supsi.ch (Roberto Nunnari) Date: Thu, 17 Jan 2013 15:44:13 +0100 Subject: [torqueusers] building a department GPU cluster Message-ID: <50F80E3D.1050302@supsi.ch> Hi all. I'm writing to you to ask for advice or a hint to the right direction. In our department, more and more researchers ask us (IT administrators) to assemble (or to buy) GPGPU powered workstations to do parallel computing. As I already manage a small CPU cluster (resources managed using SGE), with my boss we talked about building a new GPU cluster. The problem is that I have no experience at all with GPU clusters. Apart from the already running GPU workstations, we already have some new HW that looks promising to me as a starting point for temporary building and testing a GPU cluster. - 1x Dell PowerEdge R720 - 1x Dell PowerEdge C410x - 1x NVIDIA M2090 PCIe x16 - 1x NVIDIA iPASS Cable Kit I'd be grateful if you could kindly give me some advice and/or hint to the right direction. In particular I'm interested on your opinion on: 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? 2) is torque suitable (or what should we use?) as a queuing and resource management system? We would like the cluster to be usable by many users at once in a way that no user has to worry about resources, just like we do on the CPU cluster with SGE. 3) What distribution of linux would be more appropriate? 4) necessary stack of sw? (cuda, torque, hadoop?, other?) Thank you very much for your valuable insight! Best regards. Robi From dbeer at adaptivecomputing.com Thu Jan 17 11:56:10 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 17 Jan 2013 11:56:10 -0700 Subject: [torqueusers] Problems with torque/maui mailing lists: missing PTR for supercluster.org IP In-Reply-To: <20130117084330.GA15862@rrze.uni-erlangen.de> References: <20130117084330.GA15862@rrze.uni-erlangen.de> Message-ID: I have passed this information along to our IT department and I'll let you know once it is fixed. David On Thu, Jan 17, 2013 at 1:43 AM, wrote: > Hello, > > since probably beginning of December, I have severe problems > receiving emails from the torque and maui mailing lists. Our mail > server refuses to accept any emails from supercluster.org because > of a missing PTR Resource Record in DNS. Because of the missing > PTR, reverse DNS lookups do not work for supercluster.org's IP. > > Jan 16 11:01:58 bolte82 postfix-rz-in/smtpd[26316]: NOQUEUE: reject: > RCPT from unknown[68.178.11.139]: 550 5.7.1 Client host rejected: > cannot find your reverse hostname, [68.178.11.139]; > from= > to= proto=SMTP > helo= > > > $ host supercluster.org > supercluster.org has address 68.178.11.139 > supercluster.org mail is handled by 0 supercluster.org. > supercluster.org mail is handled by 10 clusterresources.com. > > > $ nslookup 68.178.11.139 > ** server can't find 139.11.178.68.in-addr.arpa.: NXDOMAIN > > > Google's Public DNS servers tells exactly the same: > > $ nslookup 68.178.11.139 - 8.8.8.8 > Server: 8.8.8.8 > Address: 8.8.8.8#53 > ** server can't find 139.11.178.68.in-addr.arpa.: NXDOMAIN > > > Thus, it does not seem to be a local issue at our side but > something Adaptive Computing (or who ever owns supercluster.org) > has to fix. > > > Kind regards, > > Thomas Zeiser > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130117/40a5b6b2/attachment.html From roberto.nunnari at supsi.ch Fri Jan 18 00:59:03 2013 From: roberto.nunnari at supsi.ch (Roberto Nunnari) Date: Fri, 18 Jan 2013 08:59:03 +0100 Subject: [torqueusers] building a department GPU cluster In-Reply-To: <50F80E3D.1050302@supsi.ch> References: <50F80E3D.1050302@supsi.ch> Message-ID: <50F900C7.60609@supsi.ch> Roberto Nunnari wrote: > Hi all. > > I'm writing to you to ask for advice or a hint to the right direction. > > In our department, more and more researchers ask us (IT administrators) > to assemble (or to buy) GPGPU powered workstations to do parallel computing. > > As I already manage a small CPU cluster (resources managed using SGE), > with my boss we talked about building a new GPU cluster. The problem is > that I have no experience at all with GPU clusters. > > Apart from the already running GPU workstations, we already have some > new HW that looks promising to me as a starting point for temporary > building and testing a GPU cluster. > > - 1x Dell PowerEdge R720 > - 1x Dell PowerEdge C410x > - 1x NVIDIA M2090 PCIe x16 > - 1x NVIDIA iPASS Cable Kit > > I'd be grateful if you could kindly give me some advice and/or hint to > the right direction. > > In particular I'm interested on your opinion on: > 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? > 2) is torque suitable (or what should we use?) as a queuing and resource > management system? We would like the cluster to be usable by many users > at once in a way that no user has to worry about resources, just like we > do on the CPU cluster with SGE. > 3) What distribution of linux would be more appropriate? > 4) necessary stack of sw? (cuda, torque, hadoop?, other?) > > Thank you very much for your valuable insight! > > Best regards. > Robi Anybody on this, please? Robi From brockp at umich.edu Fri Jan 18 14:15:17 2013 From: brockp at umich.edu (Brock Palen) Date: Fri, 18 Jan 2013 16:15:17 -0500 Subject: [torqueusers] building a department GPU cluster In-Reply-To: <50F900C7.60609@supsi.ch> References: <50F80E3D.1050302@supsi.ch> <50F900C7.60609@supsi.ch> Message-ID: <43594EA6-66E2-47A7-B590-D236EE9B4665@umich.edu> > Roberto Nunnari wrote: >> Hi all. >> >> I'm writing to you to ask for advice or a hint to the right direction. >> >> In our department, more and more researchers ask us (IT administrators) >> to assemble (or to buy) GPGPU powered workstations to do parallel computing. >> >> As I already manage a small CPU cluster (resources managed using SGE), >> with my boss we talked about building a new GPU cluster. The problem is >> that I have no experience at all with GPU clusters. I think SGE (OGE) understands allocating GPU's I would check with them. If they have problems Torque does support GPU allocation, you can see an example in our docs: http://cac.engin.umich.edu/resources/software/cuda.html >> >> Apart from the already running GPU workstations, we already have some >> new HW that looks promising to me as a starting point for temporary >> building and testing a GPU cluster. >> >> - 1x Dell PowerEdge R720 >> - 1x Dell PowerEdge C410x >> - 1x NVIDIA M2090 PCIe x16 >> - 1x NVIDIA iPASS Cable Kit This is really outside the scope of this list for the Torque resource manager, You probably want to get on a generic HPC list or ping me offline to keep the list on topic. I have all that gear and I would call it all last generation. >> >> I'd be grateful if you could kindly give me some advice and/or hint to >> the right direction. >> >> In particular I'm interested on your opinion on: >> 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? Yes sub optimally, but I wouldn't buy M20xx cards, I would only get k20 based cards now that they are available. >> 2) is torque suitable (or what should we use?) as a queuing and resource >> management system? We would like the cluster to be usable by many users >> at once in a way that no user has to worry about resources, just like we >> do on the CPU cluster with SGE. Torque does work well and supports very large systems (ours is 1200 nodes 14,000 CPU cores and a few hundred GPU's) we pair it with Moab for scheduling. >> 3) What distribution of linux would be more appropriate? Doesn't really mater, but stick wtih the main ones, RedHat, CentOS, Debian. To use the GPU's you have to install Nvidias binary stuff and they won't test on everything and dev for won't have the source access to make it work. >> 4) necessary stack of sw? (cuda, torque, hadoop?, other?) To just have GPU's only cuda and the Nvidia driver are requires (read about nvidia-smi heavily), Torque (or SGE in your case) if you want to use a batch queue system. Hadoop doesn't even belong in this discussion currently. >> >> Thank you very much for your valuable insight! >> >> Best regards. >> Robi > > Anybody on this, please? > Robi > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From hector.ohhm at gmail.com Sat Jan 19 07:22:02 2013 From: hector.ohhm at gmail.com (Hector Oliver) Date: Sat, 19 Jan 2013 08:22:02 -0600 Subject: [torqueusers] building a department GPU cluster In-Reply-To: <50F80E3D.1050302@supsi.ch> References: <50F80E3D.1050302@supsi.ch> Message-ID: Hello, I'm Hector Oliver. i suggest some configuratios but the most important it's which GPUs you think use? 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? which GPUs you think use? if use Tesla Fermi, not use mora than 2 by WS. if use Keppler you can use mora than 2 by WS. i suggest use almost 4GBfor core of CPU. 2) is torque suitable (or what should we use?) as a queuing and resource management system? We would like the cluster to be usable by many users at once in a way that no user has to worry about resources, just like we do on the CPU cluster with SGE. Torque its nice but the lastest releases 3) What distribution of linux would be more appropriate? taht you prefer (centOS, etc) 4) necessary stack of sw? (cuda, torque, hadoop?, other?) Yes,Cuda Toolkit Best Regards in advance. On Thu, Jan 17, 2013 at 8:44 AM, Roberto Nunnari wrote: > Hi all. > > I'm writing to you to ask for advice or a hint to the right direction. > > In our department, more and more researchers ask us (IT administrators) > to assemble (or to buy) GPGPU powered workstations to do parallel > computing. > > As I already manage a small CPU cluster (resources managed using SGE), > with my boss we talked about building a new GPU cluster. The problem is > that I have no experience at all with GPU clusters. > > Apart from the already running GPU workstations, we already have some > new HW that looks promising to me as a starting point for temporary > building and testing a GPU cluster. > > - 1x Dell PowerEdge R720 > - 1x Dell PowerEdge C410x > - 1x NVIDIA M2090 PCIe x16 > - 1x NVIDIA iPASS Cable Kit > > I'd be grateful if you could kindly give me some advice and/or hint to > the right direction. > > In particular I'm interested on your opinion on: > 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? > 2) is torque suitable (or what should we use?) as a queuing and resource > management system? We would like the cluster to be usable by many users > at once in a way that no user has to worry about resources, just like we > do on the CPU cluster with SGE. > 3) What distribution of linux would be more appropriate? > 4) necessary stack of sw? (cuda, torque, hadoop?, other?) > > Thank you very much for your valuable insight! > > Best regards. > Robi > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130119/970551d0/attachment-0001.html From jjc at iastate.edu Mon Jan 21 10:07:09 2013 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 21 Jan 2013 17:07:09 +0000 Subject: [torqueusers] resources_available.mem not able to go above 2TB Message-ID: <242421BFAF465844BE24EB90BB97E2210FA34CF5@ITSDAG1D.its.iastate.edu> With new compute nodes of 256GB, I have tried setting resources available above 2TB (2048GB) but it seems to wrap at that point. I assume that the problem is something like this being saved as an int in units of KB so that 2 billion KB is the max. Is resources_available.mem needed? (Can it be unset or set to -1 or something so that it is not enforced?) Is this problem addressed in a newer version of torque? (I am currently running 2.5.4 and setting up version 4 for a newer cluster coming in.) Thanks, Jim Coyle Research Computing Group 115 Durham Center http://jjc.public.iastate.edu/ Iowa State Univ. Ames Iowa 50011 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130121/e969360d/attachment.html From bdandrus at nps.edu Mon Jan 21 13:53:13 2013 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Mon, 21 Jan 2013 20:53:13 +0000 Subject: [torqueusers] Max interactive jobs? In-Reply-To: References: Message-ID: I had thought about that, but we use a routing queue based upon walltime. Hmm. I'll have to see if I can have a queue that only accepts interactive (so non-interactive cannot specify the interactive queue) and then it could work. Brian -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Mariusz Mamonski Sent: Wednesday, January 16, 2013 8:51 AM To: Torque Users Mailing List Subject: Re: [torqueusers] Max interactive jobs? what about creating a new queue, e.g. interactive, setting the max_user... as Ken suggest and setting in all other queues: set queue long disallowed_types = interactive ? On 15 January 2013 19:23, Andrus, Brian Contractor wrote: > All, > > Is there a way in torque or moab to limit the total number of interactive jobs per user? I am not seeing one. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Mariusz _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From thomas.exner at uni-konstanz.de Mon Jan 21 14:04:03 2013 From: thomas.exner at uni-konstanz.de (Thomas Exner) Date: Mon, 21 Jan 2013 22:04:03 +0100 Subject: [torqueusers] torque server on a virtual machine Message-ID: <50FDAD43.2020306@uni-konstanz.de> Dear torque users and developers: I have a very strange problem. I set up a kernel-based virtual machine (KVM) on an OpenSuse 12.1 system. This virtual machine should be used as the torque server for a small cluster. Installation went fine but when the first client tries to connect, I get thousands of these error massages: PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect from 10.0.2.2:1023 (address not trusted - check entry in server_priv/nodes) The funny thing is that the IP address is not the IP of the client (10.0.2.11) but of the host running the virtual machine (10.0.2.2). Thus it seems that torque thinks that the request is coming from the KVM-server host and not the torque client. But there is definitely no pbs_mom running on 10.0.2.2. If 10.0.2.2 is added to the nodes file, pbsnodes shows this machine but with the data of 10.0.2.11. This has definitely something to do with the network setup to get KVM to work. For this a network bridge is needed. Does anyone know if this bridge is the problem and if I can configure it to send the original IP? Unfortunately, my knowledge about such a setting is very limited. But everything except torque seems to run nicely. Any other idea? Thank you very much. Thomas -- ________________________________________________________________________________ Dr. Thomas E. Exner Theoretische Pharmazeutische Chemie & Biophysik Lehrstuhl Pharmazeutische Chemie Pharmazeutisches Institut Eberhard Karls Universit?t T?bingen Auf der Morgenstelle 8 (Haus B) 72076 T?bingen Germany Tel.: +49-(0)7071-2976969 Mobil: +49-(0)171-3807485 Fax: +49-(0)7071-295637 E-Mail: Thomas.Exner[at]uni-tuebingen.de Fachbereich Chemie und Zukunftskolleg Universit?t Konstanz 78457 Konstanz Germany Tel.: +49-(0)7531-882015 Fax: +49-(0)7531-883587 E-Mail: Thomas.Exner[at]uni-konstanz.de ________________________________________________________________________________ From siegert at sfu.ca Mon Jan 21 14:12:08 2013 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 21 Jan 2013 13:12:08 -0800 Subject: [torqueusers] Max interactive jobs? In-Reply-To: References: Message-ID: <20130121211208.GA7272@stikine.sfu.ca> Hi Brian, On Mon, Jan 21, 2013 at 08:53:13PM +0000, Andrus, Brian Contractor wrote: > > I had thought about that, but we use a routing queue based upon walltime. > Hmm. I'll have to see if I can have a queue that only accepts interactive > (so non-interactive cannot specify the interactive queue) and then it could > work. > > Brian we have a routing queue that routes to several queues based on procct and to an interactive queue, basically: create queue default set queue default queue_type = Route set queue default route_destinations = q1 set queue default route_destinations += qs set queue default route_destinations += ql set queue default route_destinations += interactive ... create queue qs set queue qs queue_type = Execution set queue qs from_route_only = True set queue qs resources_max.procct = 128 set queue qs resources_min.procct = 2 set queue qs disallowed_types = interactive ... create queue interactive set queue interactive queue_type = Execution set queue interactive disallowed_types = batch set queue interactive max_user_queuable = 2 ... and in the nodes file we have reserved nodes for interactive jobs with the interactive feature: ... b399 np=12 x5650 batch b400 np=12 x5650 batch b402 np=24 interactive b403 np=24 interactive whereas all other nodes have the batch feature set. This way users only need to do "qsub -I submitscript.pbs" and the job will end up on one of the interactive nodes. Cheers, Martin -- Martin Siegert Simon Fraser University Burnaby, British Columbia > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Mariusz Mamonski > Sent: Wednesday, January 16, 2013 8:51 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Max interactive jobs? > > what about creating a new queue, e.g. interactive, setting the max_user... as Ken suggest and setting in all other queues: set queue long disallowed_types = interactive ? > > On 15 January 2013 19:23, Andrus, Brian Contractor wrote: > > All, > > > > Is there a way in torque or moab to limit the total number of interactive jobs per user? I am not seeing one. > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Mariusz > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Mon Jan 21 16:54:22 2013 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 22 Jan 2013 10:54:22 +1100 Subject: [torqueusers] Max interactive jobs? In-Reply-To: References: Message-ID: <50FDD52E.5020104@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/01/13 05:23, Andrus, Brian Contractor wrote: > Is there a way in torque or moab to limit the total number of > interactive jobs per user? I am not seeing one. Might be worth investigating if Moab lets you map interactive jobs to a class and then limit via that? Worth asking on the moabusers list I'd suggest as it's really a scheduling decision rather than a RM one. cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlD91S4ACgkQO2KABBYQAh8EdACcCcJMHGpDuJ04070MG+qFBTbW 7bwAn3MTj039v1jXbKaulo8+57YTTm8/ =5iQS -----END PGP SIGNATURE----- From knielson at adaptivecomputing.com Mon Jan 21 19:24:02 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 21 Jan 2013 19:24:02 -0700 Subject: [torqueusers] TORQUE 4.2.0 on GitHub but not ready for GA Message-ID: Hi all, Those of you who are familiar with TORQUE on GitHub may have noticed a 4.2.0 branch was added this evening. This is not yet ready for release. We still need to do some testing and bug fixes. You are welcome to try it in test environments and let us know of any problems you find. But please do not deploy this into production. Thanks Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130121/a24968cb/attachment.html From roberto.nunnari at supsi.ch Tue Jan 22 06:27:01 2013 From: roberto.nunnari at supsi.ch (Roberto Nunnari) Date: Tue, 22 Jan 2013 14:27:01 +0100 Subject: [torqueusers] building a department GPU cluster In-Reply-To: <43594EA6-66E2-47A7-B590-D236EE9B4665@umich.edu> References: <50F80E3D.1050302@supsi.ch> <50F900C7.60609@supsi.ch> <43594EA6-66E2-47A7-B590-D236EE9B4665@umich.edu> Message-ID: <50FE93A5.10605@supsi.ch> Hi all. Thank you very much for your time and help. Best regards. Robi Brock Palen wrote: >> Roberto Nunnari wrote: >>> Hi all. >>> >>> I'm writing to you to ask for advice or a hint to the right direction. >>> >>> In our department, more and more researchers ask us (IT administrators) >>> to assemble (or to buy) GPGPU powered workstations to do parallel computing. >>> >>> As I already manage a small CPU cluster (resources managed using SGE), >>> with my boss we talked about building a new GPU cluster. The problem is >>> that I have no experience at all with GPU clusters. > > I think SGE (OGE) understands allocating GPU's I would check with them. If they have problems Torque does support GPU allocation, you can see an example in our docs: > http://cac.engin.umich.edu/resources/software/cuda.html > >>> Apart from the already running GPU workstations, we already have some >>> new HW that looks promising to me as a starting point for temporary >>> building and testing a GPU cluster. >>> >>> - 1x Dell PowerEdge R720 >>> - 1x Dell PowerEdge C410x >>> - 1x NVIDIA M2090 PCIe x16 >>> - 1x NVIDIA iPASS Cable Kit > > This is really outside the scope of this list for the Torque resource manager, You probably want to get on a generic HPC list or ping me offline to keep the list on topic. > > I have all that gear and I would call it all last generation. > >>> I'd be grateful if you could kindly give me some advice and/or hint to >>> the right direction. >>> >>> In particular I'm interested on your opinion on: >>> 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? > > Yes sub optimally, but I wouldn't buy M20xx cards, I would only get k20 based cards now that they are available. > >>> 2) is torque suitable (or what should we use?) as a queuing and resource >>> management system? We would like the cluster to be usable by many users >>> at once in a way that no user has to worry about resources, just like we >>> do on the CPU cluster with SGE. > > Torque does work well and supports very large systems (ours is 1200 nodes 14,000 CPU cores and a few hundred GPU's) we pair it with Moab for scheduling. > >>> 3) What distribution of linux would be more appropriate? > > Doesn't really mater, but stick wtih the main ones, RedHat, CentOS, Debian. To use the GPU's you have to install Nvidias binary stuff and they won't test on everything and dev for won't have the source access to make it work. > >>> 4) necessary stack of sw? (cuda, torque, hadoop?, other?) > > To just have GPU's only cuda and the Nvidia driver are requires (read about nvidia-smi heavily), > > Torque (or SGE in your case) if you want to use a batch queue system. > > Hadoop doesn't even belong in this discussion currently. > >>> Thank you very much for your valuable insight! >>> >>> Best regards. >>> Robi >> Anybody on this, please? >> Robi From msbritt at umich.edu Tue Jan 22 12:22:07 2013 From: msbritt at umich.edu (Matthew Britt) Date: Tue, 22 Jan 2013 14:22:07 -0500 Subject: [torqueusers] PBS environmental variables and -V Message-ID: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Hello everyone. What is the expected behavior in precedence between PBS environmental variables and -V when a job is submitted from an interactive job. With torque 4.1.4 (and possibly earlier), the variables from the environment of the interactive shell are passed along w/ the newly submitted job (either interactive or batch). We've seen both PBS_O_HOST and PBS_O_WORKDIR be set to values of the first job rather than the attributes of the second job. As an example: [msbritt at nyx ~]$ cd bin [msbritt at nyx bin]$ pwd /home/msbritt/bin [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux -V qsub: waiting for job 9445802.nyx.engin.umich.edu to start qsub: job 9445802.nyx.engin.umich.edu ready [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR /home/msbritt/bin [msbritt at nyx5515 ~]$ echo $PBS_O_HOST nyx.engin.umich.edu [msbritt at nyx5515 ~]$ pwd /home/msbritt [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux -V qsub: waiting for job 9445813.nyx.engin.umich.edu to start qsub: job 9445813.nyx.engin.umich.edu ready [msbritt at nyx5623 ~]$ pwd /home/msbritt [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR /home/msbritt/bin (arguably should be /home/msbritt) [msbritt at nyx5623 ~]$ echo $PBS_O_HOST nyx.engin.umich.edu (arguably should be nyx5515) Should -V not read the PBS_O_* variables on job submission or at least be overwritten and correctly set in the next job, or should -V trump ? Thanks, - Matt -------------------------------------------- Matthew Britt CAEN HPC Group - College of Engineering msbritt at umich.edu From glen.beane at gmail.com Tue Jan 22 13:42:02 2013 From: glen.beane at gmail.com (Glen Beane) Date: Tue, 22 Jan 2013 15:42:02 -0500 Subject: [torqueusers] PBS environmental variables and -V In-Reply-To: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> References: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Message-ID: On Tue, Jan 22, 2013 at 2:22 PM, Matthew Britt wrote: > Hello everyone. What is the expected behavior in precedence between PBS environmental variables and -V when a job is submitted from an interactive job. With torque 4.1.4 (and possibly earlier), the variables from the environment of the interactive shell are passed along w/ the newly submitted job (either interactive or batch). We've seen both PBS_O_HOST and PBS_O_WORKDIR be set to values of the first job rather than the attributes of the second job. > > As an example: > > [msbritt at nyx ~]$ cd bin > [msbritt at nyx bin]$ pwd > /home/msbritt/bin > [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux -V > qsub: waiting for job 9445802.nyx.engin.umich.edu to start > qsub: job 9445802.nyx.engin.umich.edu ready > > [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR > /home/msbritt/bin > [msbritt at nyx5515 ~]$ echo $PBS_O_HOST > nyx.engin.umich.edu > [msbritt at nyx5515 ~]$ pwd > /home/msbritt > > [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux -V > qsub: waiting for job 9445813.nyx.engin.umich.edu to start > qsub: job 9445813.nyx.engin.umich.edu ready > > [msbritt at nyx5623 ~]$ pwd > /home/msbritt > [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR > /home/msbritt/bin (arguably should be /home/msbritt) > [msbritt at nyx5623 ~]$ echo $PBS_O_HOST > nyx.engin.umich.edu (arguably should be nyx5515) > this seems like a bug to me -- it is not the behavior I would expect. > Should -V not read the PBS_O_* variables on job submission or at least be overwritten and correctly set in the next job, or should -V trump ? I think all PBS_ variables should be correctly set by TORQUE even if they are already present in the environment passed by the -V argument From pregier at ittc.ku.edu Tue Jan 22 14:33:24 2013 From: pregier at ittc.ku.edu (Phil Regier) Date: Tue, 22 Jan 2013 15:33:24 -0600 (CST) Subject: [torqueusers] Troubleshooting blcr checkpoint/restart In-Reply-To: <171387839.10552.1358888560193.JavaMail.root@ittc.ku.edu> Message-ID: <1544394918.10651.1358890404583.JavaMail.root@ittc.ku.edu> Hi, all; I've got a strange issue here which might be a simple misconfig or may be a deeper issue with our setup or perhaps even some torque code, but I'm having a difficult time figuring out where to look. We're trying to implement blcr checkpointing for an existing cluster and testing on a temporary test cluster; blcr is operational locally on the one node I'm using to test qhold/qrls, and I've found that jobs often suspend and resume just fine using a minimally modified (just enough to work, with a few added log entries) version of the scripts on the website: http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml Occasionally, however, a job will be dropped from the queue after a restart, perhaps (but not always) leaving its newly restarted process running unmonitored. Once in a great while, the MOM will crash altogether, though this is very infrequent. I scripted a qhold/qrls loop (against a prime sieve and now even /bin/sleep as my executable) to get a rough idea of an expected time to live, and was quite surprised to find that depending on the delay between qhold and qrls invocations, this failure can be very predictable; for example, with 10 seconds between commands, the job will suspend and resume successfully for exactly 5 minutes (i.e., exactly 15 total checkpoint/restart cycles) until the job disappears. This pattern varies consistently (though not monotonically) with delay time; I can forward the script and sample measurements if that might be useful. More to the point, where should I be looking for a root cause when blcr checkpoint/restart fails intermittently for even the most trivial jobs? The server logs appear normal for these tests, and the MOM logs are quite cryptic; the closest I've found to an indicator would be 01/22/2013 15:19:46;0080; pbs_mom.28548;Job;36320.fusion-devel.local;checking job w/subtask pid=0 (child pid=1601) 01/22/2013 15:19:46;0008; pbs_mom.28548;Job;scan_for_terminated;pid 1601 not tracked, statloc=0, exitval=0 01/22/2013 15:19:46;0080; pbs_mom.28548;Req;dis_request_read;decoding command DeleteJob from PBS_Server 01/22/2013 15:19:46;0100; pbs_mom.28548;Req;;Type DeleteJob request received from PBS_Server at fusion-devel.local, sock=8 01/22/2013 15:19:46;0008; pbs_mom.28548;Job;mom_process_request;request type DeleteJob from host fusion-devel.local received 01/22/2013 15:19:46;0008; pbs_mom.28548;Job;mom_process_request;request type DeleteJob from host fusion-devel.local allowed 01/22/2013 15:19:46;0008; pbs_mom.28548;Job;mom_dispatch_request;dispatching request DeleteJob on sd=8 01/22/2013 15:19:46;0008; pbs_mom.28548;Job;36320.fusion-devel.local;deleting job ...but this doesn't seem to be much to go on (though I don't understand MOM logs very well). What could possibly cause this behavior, and/or where would I look for further clues? Many thanks in advance for any guidance. Phil Regier pregier at ittc.ku.edu From mej at lbl.gov Tue Jan 22 18:23:53 2013 From: mej at lbl.gov (Michael Jennings) Date: Tue, 22 Jan 2013 17:23:53 -0800 Subject: [torqueusers] Warewulf NHC 1.2.2 Release Message-ID: <20130123012352.GQ8827@lbl.gov> Due primarily to a couple pretty important bug fixes, I've gone ahead and released version 1.2.2 of Warewulf NHC. I'm hoping to get at least one more release after this one out the door before MoabCon 2013. :-) Warewulf Node Health Check, for anyone not familiar with it, is an effort to create a framework and implementation for the node health check scripts often used by resource managers and schedulers as well as for periodic independent node sanity checks. More information and complete documentation may be found at: http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check If you want to skip right to the packages, here's the direct link to the tarball. RPMs for RHEL5 and RHEL6, as well as yum repositories for each, are available as well. http://warewulf.lbl.gov/downloads/releases/warewulf-nhc/warewulf-nhc-1.2.2.tar.gz http://warewulf.lbl.gov/downloads/repo/rhel5/ http://warewulf.lbl.gov/downloads/repo/rhel6/ CHANGES: - The watchdog timer is more reliable now and should work with all supported versions of bash. - The signal handlers were cleaned up and a couple minor issues fixed. - NEW CHECK: check_hw_mcelog() was implemented as requested and discussed on the torqueusers mailing list. - NEW FEATURE: pdsh-style node range support has been added thanks to a patch from John Hanks . You can now specify one or more comma-separated node range expressions for matching nodes. The whole thing must be surrounded by braces to avoid conflicts with globbing syntax. A couple examples: {node[00-99]} || check_hw_cpuinfo 2 12 24 {n0,n3,n[5-8]} || check_hw_cpuinfo 2 8 8 {n[0-123].cchem} || check_ps_daemon sshd root Note that you can't have commas *inside* the brackets, nor can you have more than one bracketed range per subexpression. - NEW FEATURE: LDAP/NIS/NIS+/etc. support. If direct passwd file lookup for a userid or UID fails, NHC will now fall back to using "getent" to do the resolution. Parsing sources other than /etc/passwd is also supported. See the online docs for more details. Please note that the 2 new features listed above are EXPERIMENTAL and have not been thoroughly tested. Any issues should be reported via the mailing list or warewulf.lbl.gov bug tracker. Any/all feedback welcome! Enjoy! :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From msbritt at umich.edu Wed Jan 23 09:37:18 2013 From: msbritt at umich.edu (Matthew Britt) Date: Wed, 23 Jan 2013 11:37:18 -0500 Subject: [torqueusers] Warewulf NHC 1.2.2 Release In-Reply-To: <20130123012352.GQ8827@lbl.gov> References: <20130123012352.GQ8827@lbl.gov> Message-ID: Thanks Michael. I've upgraded from 1.2.1 to 1.2.2 and tested on three different nodes, and the time to run went from an average of .45 seconds to 3.5-5 seconds (same config). How do you enable debug mode and does it give timing for each test being done so I can see where the time is going? Thanks, - Matt -------------------------------------------- Matthew Britt CAEN HPC Group - College of Engineering msbritt at umich.edu On Jan 22, 2013, at 8:23 PM, Michael Jennings wrote: > Due primarily to a couple pretty important bug fixes, I've gone ahead > and released version 1.2.2 of Warewulf NHC. I'm hoping to get at > least one more release after this one out the door before MoabCon > 2013. :-) > > Warewulf Node Health Check, for anyone not familiar with it, is an > effort to create a framework and implementation for the node health > check scripts often used by resource managers and schedulers as well > as for periodic independent node sanity checks. More information and > complete documentation may be found at: > > http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check > > If you want to skip right to the packages, here's the direct link to > the tarball. RPMs for RHEL5 and RHEL6, as well as yum repositories > for each, are available as well. > > http://warewulf.lbl.gov/downloads/releases/warewulf-nhc/warewulf-nhc-1.2.2.tar.gz > http://warewulf.lbl.gov/downloads/repo/rhel5/ > http://warewulf.lbl.gov/downloads/repo/rhel6/ > > > CHANGES: > > - The watchdog timer is more reliable now and should work with all > supported versions of bash. > - The signal handlers were cleaned up and a couple minor issues > fixed. > - NEW CHECK: check_hw_mcelog() was implemented as requested and > discussed on the torqueusers mailing list. > - NEW FEATURE: pdsh-style node range support has been added thanks > to a patch from John Hanks . You can now > specify one or more comma-separated node range expressions for > matching nodes. The whole thing must be surrounded by braces to > avoid conflicts with globbing syntax. A couple examples: > {node[00-99]} || check_hw_cpuinfo 2 12 24 > {n0,n3,n[5-8]} || check_hw_cpuinfo 2 8 8 > {n[0-123].cchem} || check_ps_daemon sshd root > Note that you can't have commas *inside* the brackets, nor can you > have more than one bracketed range per subexpression. > - NEW FEATURE: LDAP/NIS/NIS+/etc. support. If direct passwd file > lookup for a userid or UID fails, NHC will now fall back to using > "getent" to do the resolution. Parsing sources other than > /etc/passwd is also supported. See the online docs for more > details. > > Please note that the 2 new features listed above are EXPERIMENTAL and > have not been thoroughly tested. Any issues should be reported via > the mailing list or warewulf.lbl.gov bug tracker. > > Any/all feedback welcome! Enjoy! :-) > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From shamgpbhu at gmail.com Tue Jan 22 22:58:53 2013 From: shamgpbhu at gmail.com (Radheyshyam Yadav) Date: Wed, 23 Jan 2013 11:28:53 +0530 Subject: [torqueusers] help Message-ID: can any one to help to me for reproduce the result in paper which is i attached if any one help to me then i am highly obliged to you. thanks Radheshyam yadav G & M Division CSIR-National Geophysical Research Institute, Uppal Road, Hyderabad-500 606 (A.P.), INDIA Web: http://www.ngri.org.in/ Mo.+91-7569632501,+91-9696762096 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130123/b26f844e/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: rehelp.zip Type: application/zip Size: 305583 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20130123/b26f844e/attachment-0001.zip From mej at lbl.gov Wed Jan 23 09:59:00 2013 From: mej at lbl.gov (Michael Jennings) Date: Wed, 23 Jan 2013 08:59:00 -0800 Subject: [torqueusers] Warewulf NHC 1.2.2 Release In-Reply-To: References: <20130123012352.GQ8827@lbl.gov> Message-ID: <20130123165859.GW8827@lbl.gov> On Wednesday, 23 January 2013, at 11:37:18 (-0500), Matthew Britt wrote: > Thanks Michael. I've upgraded from 1.2.1 to 1.2.2 and tested on > three different nodes, and the time to run went from an average of > .45 seconds to 3.5-5 seconds (same config). How do you enable debug > mode and does it give timing for each test being done so I can see > where the time is going? Do you use LDAP/NIS/etc. or have any userids or UIDs missing from your passwd file? Those are the most likely culprit. While previously failed lookups would be ignored, it's now using "getent" to try to resolve them. Debug mode can be enabled from the config, /etc/sysconfig/nhc, or the runtime environment by setting DEBUG=1. It won't give timing info, though; for that, you might want to use "bash -x /usr/sbin/nhc" instead. That should allow you to see where the blocking is happening. If it turns out to be the fallback code, it's easy enough to make it disable-able. :-) Thanks for the feedback! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From bdandrus at nps.edu Wed Jan 23 12:54:55 2013 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 23 Jan 2013 19:54:55 +0000 Subject: [torqueusers] PBS environmental variables and -V In-Reply-To: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> References: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Message-ID: Seems to me that would be by design and you need to be aware of it. -V basically just takes the output of 'env' and sets anything that is set. Since you already have PBS_* variables, I would expect them to be set already. This could be handled by pbs_mom if it were to first sent the -V stuff and then set the PBS_* stuff. You can do this by iterating through and 'unset' all the PBS_* variables before doing the qsub. I tend to rend this moot by highly discouraging the use of -V It's use makes it difficult to troubleshoot when folks use things like "./a.out" to run their programs. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > bounces at supercluster.org] On Behalf Of Matthew Britt > Sent: Tuesday, January 22, 2013 11:22 AM > To: Torque Users Mailing List > Subject: [torqueusers] PBS environmental variables and -V > > Hello everyone. What is the expected behavior in precedence between > PBS environmental variables and -V when a job is submitted from an > interactive job. With torque 4.1.4 (and possibly earlier), the variables from > the environment of the interactive shell are passed along w/ the newly > submitted job (either interactive or batch). We've seen both PBS_O_HOST > and PBS_O_WORKDIR be set to values of the first job rather than the > attributes of the second job. > > As an example: > > [msbritt at nyx ~]$ cd bin > [msbritt at nyx bin]$ pwd > /home/msbritt/bin > [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux - > V > qsub: waiting for job 9445802.nyx.engin.umich.edu to start > qsub: job 9445802.nyx.engin.umich.edu ready > > [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR > /home/msbritt/bin > [msbritt at nyx5515 ~]$ echo $PBS_O_HOST > nyx.engin.umich.edu > [msbritt at nyx5515 ~]$ pwd > /home/msbritt > > [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A > msbritt_flux -V > qsub: waiting for job 9445813.nyx.engin.umich.edu to start > qsub: job 9445813.nyx.engin.umich.edu ready > > [msbritt at nyx5623 ~]$ pwd > /home/msbritt > [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR > /home/msbritt/bin (arguably should be /home/msbritt) > [msbritt at nyx5623 ~]$ echo $PBS_O_HOST > nyx.engin.umich.edu (arguably should be nyx5515) > > > Should -V not read the PBS_O_* variables on job submission or at least be > overwritten and correctly set in the next job, or should -V trump ? > > Thanks, > - Matt > > -------------------------------------------- > Matthew Britt > CAEN HPC Group - College of Engineering > msbritt at umich.edu > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Wed Jan 23 13:40:28 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 23 Jan 2013 13:40:28 -0700 Subject: [torqueusers] PBS environmental variables and -V In-Reply-To: References: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Message-ID: On Wed, Jan 23, 2013 at 12:54 PM, Andrus, Brian Contractor wrote: > Seems to me that would be by design and you need to be aware of it. > -V basically just takes the output of 'env' and sets anything that is set. > Since you already have PBS_* variables, I would expect them to be set > already. > > This could be handled by pbs_mom if it were to first sent the -V stuff and > then set the PBS_* stuff. > You can do this by iterating through and 'unset' all the PBS_* variables > before doing the qsub. > > I tend to rend this moot by highly discouraging the use of -V > It's use makes it difficult to troubleshoot when folks use things like > "./a.out" to run their programs. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > > bounces at supercluster.org] On Behalf Of Matthew Britt > > Sent: Tuesday, January 22, 2013 11:22 AM > > To: Torque Users Mailing List > > Subject: [torqueusers] PBS environmental variables and -V > > > > Hello everyone. What is the expected behavior in precedence between > > PBS environmental variables and -V when a job is submitted from an > > interactive job. With torque 4.1.4 (and possibly earlier), the > variables from > > the environment of the interactive shell are passed along w/ the newly > > submitted job (either interactive or batch). We've seen both PBS_O_HOST > > and PBS_O_WORKDIR be set to values of the first job rather than the > > attributes of the second job. > > > > As an example: > > > > [msbritt at nyx ~]$ cd bin > > [msbritt at nyx bin]$ pwd > > /home/msbritt/bin > > [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A > msbritt_flux - > > V > > qsub: waiting for job 9445802.nyx.engin.umich.edu to start > > qsub: job 9445802.nyx.engin.umich.edu ready > > > > [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR > > /home/msbritt/bin > > [msbritt at nyx5515 ~]$ echo $PBS_O_HOST > > nyx.engin.umich.edu > > [msbritt at nyx5515 ~]$ pwd > > /home/msbritt > > > > [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A > > msbritt_flux -V > > qsub: waiting for job 9445813.nyx.engin.umich.edu to start > > qsub: job 9445813.nyx.engin.umich.edu ready > > > > [msbritt at nyx5623 ~]$ pwd > > /home/msbritt > > [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR > > /home/msbritt/bin (arguably should be /home/msbritt) > > [msbritt at nyx5623 ~]$ echo $PBS_O_HOST > > nyx.engin.umich.edu (arguably should be nyx5515) > > > > > > Should -V not read the PBS_O_* variables on job submission or at least be > > overwritten and correctly set in the next job, or should -V trump ? > > > > Thanks, > > - Matt > > > > -------------------------------------------- > > Matthew Britt > > CAEN HPC Group - College of Engineering > > msbritt at umich.edu > > > what happens when the user sets the PBS_O* variables in the environment to what they want and then TORQUE changes all of them? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130123/61bf60a9/attachment.html From msbritt at umich.edu Wed Jan 23 15:30:21 2013 From: msbritt at umich.edu (Matthew Britt) Date: Wed, 23 Jan 2013 17:30:21 -0500 Subject: [torqueusers] PBS environmental variables and -V In-Reply-To: References: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Message-ID: As a counter argument, the value of PBS_NODEFILE isn't getting set to the same value for the second job. Like PBS_NODEFILE, I would expect the PBS_O variables get set correctly for the submitted job and not the environment passed into it. It isn't consistent w/ PBS_O variables either, as PBS_O_QUEUE is overwritten w/ the correct value (I exported a different value into PBS_O_QUEUE to check). fwiw, we're using Environmental Modules, so we set several variables for software packages, like license servers, process launchers (like hydra), etc. The users might not be aware that these variables are necessary, so we have the users load appropriate module(s) and submit with -V. If the PBS_O_ variables are treated differently than PBS_ variables, that's fine; I was curious if it was by design or was a bug. - Matt -------------------------------------------- Matthew Britt CAEN HPC Group - College of Engineering msbritt at umich.edu On Jan 23, 2013, at 2:54 PM, "Andrus, Brian Contractor" wrote: > Seems to me that would be by design and you need to be aware of it. > -V basically just takes the output of 'env' and sets anything that is set. > Since you already have PBS_* variables, I would expect them to be set already. > > This could be handled by pbs_mom if it were to first sent the -V stuff and then set the PBS_* stuff. > You can do this by iterating through and 'unset' all the PBS_* variables before doing the qsub. > > I tend to rend this moot by highly discouraging the use of -V > It's use makes it difficult to troubleshoot when folks use things like "./a.out" to run their programs. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Matthew Britt >> Sent: Tuesday, January 22, 2013 11:22 AM >> To: Torque Users Mailing List >> Subject: [torqueusers] PBS environmental variables and -V >> >> Hello everyone. What is the expected behavior in precedence between >> PBS environmental variables and -V when a job is submitted from an >> interactive job. With torque 4.1.4 (and possibly earlier), the variables from >> the environment of the interactive shell are passed along w/ the newly >> submitted job (either interactive or batch). We've seen both PBS_O_HOST >> and PBS_O_WORKDIR be set to values of the first job rather than the >> attributes of the second job. >> >> As an example: >> >> [msbritt at nyx ~]$ cd bin >> [msbritt at nyx bin]$ pwd >> /home/msbritt/bin >> [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux - >> V >> qsub: waiting for job 9445802.nyx.engin.umich.edu to start >> qsub: job 9445802.nyx.engin.umich.edu ready >> >> [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR >> /home/msbritt/bin >> [msbritt at nyx5515 ~]$ echo $PBS_O_HOST >> nyx.engin.umich.edu >> [msbritt at nyx5515 ~]$ pwd >> /home/msbritt >> >> [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A >> msbritt_flux -V >> qsub: waiting for job 9445813.nyx.engin.umich.edu to start >> qsub: job 9445813.nyx.engin.umich.edu ready >> >> [msbritt at nyx5623 ~]$ pwd >> /home/msbritt >> [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR >> /home/msbritt/bin (arguably should be /home/msbritt) >> [msbritt at nyx5623 ~]$ echo $PBS_O_HOST >> nyx.engin.umich.edu (arguably should be nyx5515) >> >> >> Should -V not read the PBS_O_* variables on job submission or at least be >> overwritten and correctly set in the next job, or should -V trump ? >> >> Thanks, >> - Matt >> >> -------------------------------------------- >> Matthew Britt >> CAEN HPC Group - College of Engineering >> msbritt at umich.edu >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Wed Jan 23 16:11:27 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 23 Jan 2013 16:11:27 -0700 Subject: [torqueusers] PBS environmental variables and -V In-Reply-To: References: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Message-ID: Matt, I guess you would call it a bug. I think it is simply years of maintenance without knowledge of the original intent. Rick made a ticket for this and we will see what it currently does and try to figure out what it should do and then let everyone know. Ken On Wed, Jan 23, 2013 at 3:30 PM, Matthew Britt wrote: > As a counter argument, the value of PBS_NODEFILE isn't getting set to the > same value for the second job. Like PBS_NODEFILE, I would expect the > PBS_O variables get set correctly for the submitted job and not the > environment passed into it. It isn't consistent w/ PBS_O variables either, > as PBS_O_QUEUE is overwritten w/ the correct value (I exported a different > value into PBS_O_QUEUE to check). > > fwiw, we're using Environmental Modules, so we set several variables for > software packages, like license servers, process launchers (like hydra), > etc. The users might not be aware that these variables are necessary, so > we have the users load appropriate module(s) and submit with -V. > > If the PBS_O_ variables are treated differently than PBS_ variables, > that's fine; I was curious if it was by design or was a bug. > > - Matt > > -------------------------------------------- > Matthew Britt > CAEN HPC Group - College of Engineering > msbritt at umich.edu > > > On Jan 23, 2013, at 2:54 PM, "Andrus, Brian Contractor" > wrote: > > > Seems to me that would be by design and you need to be aware of it. > > -V basically just takes the output of 'env' and sets anything that is > set. > > Since you already have PBS_* variables, I would expect them to be set > already. > > > > This could be handled by pbs_mom if it were to first sent the -V stuff > and then set the PBS_* stuff. > > You can do this by iterating through and 'unset' all the PBS_* variables > before doing the qsub. > > > > I tend to rend this moot by highly discouraging the use of -V > > It's use makes it difficult to troubleshoot when folks use things like > "./a.out" to run their programs. > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > > > > > > >> -----Original Message----- > >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >> bounces at supercluster.org] On Behalf Of Matthew Britt > >> Sent: Tuesday, January 22, 2013 11:22 AM > >> To: Torque Users Mailing List > >> Subject: [torqueusers] PBS environmental variables and -V > >> > >> Hello everyone. What is the expected behavior in precedence between > >> PBS environmental variables and -V when a job is submitted from an > >> interactive job. With torque 4.1.4 (and possibly earlier), the > variables from > >> the environment of the interactive shell are passed along w/ the newly > >> submitted job (either interactive or batch). We've seen both > PBS_O_HOST > >> and PBS_O_WORKDIR be set to values of the first job rather than the > >> attributes of the second job. > >> > >> As an example: > >> > >> [msbritt at nyx ~]$ cd bin > >> [msbritt at nyx bin]$ pwd > >> /home/msbritt/bin > >> [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A > msbritt_flux - > >> V > >> qsub: waiting for job 9445802.nyx.engin.umich.edu to start > >> qsub: job 9445802.nyx.engin.umich.edu ready > >> > >> [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR > >> /home/msbritt/bin > >> [msbritt at nyx5515 ~]$ echo $PBS_O_HOST > >> nyx.engin.umich.edu > >> [msbritt at nyx5515 ~]$ pwd > >> /home/msbritt > >> > >> [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A > >> msbritt_flux -V > >> qsub: waiting for job 9445813.nyx.engin.umich.edu to start > >> qsub: job 9445813.nyx.engin.umich.edu ready > >> > >> [msbritt at nyx5623 ~]$ pwd > >> /home/msbritt > >> [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR > >> /home/msbritt/bin (arguably should be /home/msbritt) > >> [msbritt at nyx5623 ~]$ echo $PBS_O_HOST > >> nyx.engin.umich.edu (arguably should be nyx5515) > >> > >> > >> Should -V not read the PBS_O_* variables on job submission or at least > be > >> overwritten and correctly set in the next job, or should -V trump ? > >> > >> Thanks, > >> - Matt > >> > >> -------------------------------------------- > >> Matthew Britt > >> CAEN HPC Group - College of Engineering > >> msbritt at umich.edu > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130123/2739aed0/attachment-0001.html From cwest at vpac.org Wed Jan 23 22:23:24 2013 From: cwest at vpac.org (Craig West) Date: Thu, 24 Jan 2013 16:23:24 +1100 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <20130116013347.GP8827@lbl.gov> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> <50EA6348.6060101@vpac.org> <20130116013347.GP8827@lbl.gov> Message-ID: <5100C54C.8040505@vpac.org> Michael, > I'm still unable to reproduce this behavior, but I think I've found > the cause. > > Let me know if that works. If so, I'll submit it upstream to Adaptive. > > It's poorly documented, but if PKG_CHECK_MODULES() is invoked for the > first time inside a conditional, PKG_PROG_PKG_CONFIG must be invoked > unconditionally prior to any PKG_CHECK_MODULES() calls. Testing this is not simple. Your patch doesn't modify the "configure" script, which means I need to run autogen.sh. But I already know that if I run autogen.sh then I won't have the problem. It is worth noting that when you run the configure script, if it fails to run fully due to hwloc, you get the following message: ===== QUOTE ===== checking for HWLOC... configure: error: cpuset support requires the hwloc package This can be solved by configuring with --with-hwloc-path=. This path should be the path to the directory containing the lib/ and include/ directories for your version of hwloc. Another option is adding the directory containing 'hwloc.pc' to the PKG_CONFIG_PATH environment variable. If you have done these and still get this error, try running ./autogen.sh and then configuring again. ===== QUOTE ===== I am specifying the --with-hwloc-path, and if I use the same configure script after running autogen.sh, then it configures successfully. Is it possible that the configure script provided with torque 4.1.4 is not up to date, and that is what is causing the issue? Thanks, Craig. From glen.beane at gmail.com Thu Jan 24 09:36:57 2013 From: glen.beane at gmail.com (Glen Beane) Date: Thu, 24 Jan 2013 11:36:57 -0500 Subject: [torqueusers] remote submit hosts, job naming issues Message-ID: Hi everyone, We have a few virtual machines that run some applications that submit jobs to our cluster. These applications are using the pbs_python library, which works fine. However, I've noticed some annoying behavior using the TORQUE clients to manage jobs from that submit host. gbeane at submit_host:~> echo "sleep 600" | qsub -l nodes=1:ppn=1,walltime=00:10:00 251428.cluster_name gbeane at submit_host::~> qdel 251428 qdel: Unknown Job Id 251428.cluster_name.mydomain.tld gbeane at submit_host::~> qdel 251428.cluster_name qdel: Unknown Job Id 251428.cluster_name.mydomain.tld gbeane at submit_host::~> qdel 251428.cluster_name at cluster_name gbeane at submit_host::~> When the clients were compiled the hostname (short name) of our cluster head node was set as the default server. Does anyone know of a way to get around this? Ideally I would like "qdel 251428" to work, or at least "qdel 251428.cluster_name". From deepak_shingan at persistent.co.in Thu Jan 24 04:04:41 2013 From: deepak_shingan at persistent.co.in (Deepak Shingan) Date: Thu, 24 Jan 2013 11:04:41 +0000 Subject: [torqueusers] As part of POC can we use torque 4.2.0 for single machine acting as both server and node Message-ID: <2F92A153181B8740B9ADB294A1A2918E5DA562A7@HJ-MBX2.persistent.co.in> Hi, I am new to torque and wanted to do some POC around some requirement but I dont have cluster setup. So is it possible to run torque 4.2.0 on single machine with multiple core acting as both server and compute node. If yes what are the changes I need to do for setup? Regards, -Deepak DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/108424e9/attachment.html From biozit.listas at gmail.com Thu Jan 24 05:45:30 2013 From: biozit.listas at gmail.com (Fabio Andrijauskas) Date: Thu, 24 Jan 2013 10:45:30 -0200 Subject: [torqueusers] Job can't find the execubtale Message-ID: Hi all, I am trying to submit a job to my new cluster with the script: #!/bin/bash #PBS -l walltime=00:5:00 #PBS -l nice=19 #PBS -q linux-spool #PBS -e 2.err #PBS -o 1.out runrun the 2.err: /var/spool/torque/mom_priv/ jobs/77.server.SC: line 8: runrun: command not found But, to be successful in the process is necessary to copy the software (runrun) to the nodes manually . Is mandatory to use NFS with torque ? -- Fabio Andrijauskas http://www.marvinproject.org http://fabio.andrijauskas.com.br/ C?us limpos...... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/d6a63e7f/attachment.html From dbeer at adaptivecomputing.com Thu Jan 24 10:18:57 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 24 Jan 2013 10:18:57 -0700 Subject: [torqueusers] As part of POC can we use torque 4.2.0 for single machine acting as both server and node In-Reply-To: <2F92A153181B8740B9ADB294A1A2918E5DA562A7@HJ-MBX2.persistent.co.in> References: <2F92A153181B8740B9ADB294A1A2918E5DA562A7@HJ-MBX2.persistent.co.in> Message-ID: It is possible. You only need to have the mom_priv directory and the server_priv directory set up properly, and then list that host in the nodes file. David On Thu, Jan 24, 2013 at 4:04 AM, Deepak Shingan < deepak_shingan at persistent.co.in> wrote: > Hi, > > I am new to torque and wanted to do some POC around some requirement but I > dont have cluster setup. So is it possible to run torque 4.2.0 on single > machine with multiple core acting as both server and compute node. If yes > what are the changes I need to do for setup? > > Regards, > -Deepak > > DISCLAIMER ========== This e-mail may contain privileged and confidential > information which is the property of Persistent Systems Ltd. It is intended > only for the use of the individual or entity to which it is addressed. If > you are not the intended recipient, you are not authorized to read, retain, > copy, print, distribute or use this message. If you have received this > communication in error, please notify the sender and delete all copies of > this message. Persistent Systems Ltd. does not accept any liability for > virus infected mails. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/0692d0f7/attachment.html From knielson at adaptivecomputing.com Thu Jan 24 10:21:43 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 24 Jan 2013 10:21:43 -0700 Subject: [torqueusers] As part of POC can we use torque 4.2.0 for single machine acting as both server and node In-Reply-To: References: <2F92A153181B8740B9ADB294A1A2918E5DA562A7@HJ-MBX2.persistent.co.in> Message-ID: On Thu, Jan 24, 2013 at 10:18 AM, David Beer wrote: > It is possible. You only need to have the mom_priv directory and the > server_priv directory set up properly, and then list that host in the nodes > file. > > David > > On Thu, Jan 24, 2013 at 4:04 AM, Deepak Shingan < > deepak_shingan at persistent.co.in> wrote: > >> Hi, >> >> I am new to torque and wanted to do some POC around some requirement but >> I dont have cluster setup. So is it possible to run torque 4.2.0 on single >> machine with multiple core acting as both server and compute node. If yes >> what are the changes I need to do for setup? >> >> Regards, >> -Deepak >> >> DISCLAIMER ========== This e-mail may contain privileged and confidential >> information which is the property of Persistent Systems Ltd. It is intended >> only for the use of the individual or entity to which it is addressed. If >> you are not the intended recipient, you are not authorized to read, retain, >> copy, print, distribute or use this message. If you have received this >> communication in error, please notify the sender and delete all copies of >> this message. Persistent Systems Ltd. does not accept any liability for >> virus infected mails. >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > Deepak, Please note that 4.2.0 is not yet ready for general release. You should not have any problems with your POC, but there are still some changes going into this code base. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/14123bd1/attachment.html From tegner at renget.se Thu Jan 24 10:24:11 2013 From: tegner at renget.se (Jon Tegner) Date: Thu, 24 Jan 2013 18:24:11 +0100 Subject: [torqueusers] As part of POC can we use torque 4.2.0 for single machine acting as both server and node In-Reply-To: <2F92A153181B8740B9ADB294A1A2918E5DA562A7@HJ-MBX2.persistent.co.in> References: <2F92A153181B8740B9ADB294A1A2918E5DA562A7@HJ-MBX2.persistent.co.in> Message-ID: <51016E3B.7040801@renget.se> Hi, why not setting up a few virtual boxes? That way you can get a more "cluster like" setup. Regards, /jon On 01/24/2013 12:04 PM, Deepak Shingan wrote: > Hi, > > I am new to torque and wanted to do some POC around some requirement > but I dont have cluster setup. So is it possible to run torque 4.2.0 > on single machine with multiple core acting as both server and compute > node. If yes what are the changes I need to do for setup? > > Regards, > -Deepak > > DISCLAIMER ========== This e-mail may contain privileged and > confidential information which is the property of Persistent Systems > Ltd. It is intended only for the use of the individual or entity to > which it is addressed. If you are not the intended recipient, you are > not authorized to read, retain, copy, print, distribute or use this > message. If you have received this communication in error, please > notify the sender and delete all copies of this message. Persistent > Systems Ltd. does not accept any liability for virus infected mails. > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/3acec736/attachment-0001.html From jjc at iastate.edu Thu Jan 24 12:20:04 2013 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 24 Jan 2013 19:20:04 +0000 Subject: [torqueusers] Job can't find the execubtale In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210FA357D3@ITSDAG1D.its.iastate.edu> Fabio, How else is it going to find it your executable? It is very common to use a common filesystem, such as NFS or Lustre . You can copy the executable, but yu alsos need all data, etc. I?d also put cd $PBS_O_WORKDIR in the script before runrun, and probably also use ./runrun in case ./ is not in your path. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Fabio Andrijauskas Sent: Thursday, January 24, 2013 6:46 AM To: torqueusers at supercluster.org Subject: [torqueusers] Job can't find the execubtale Hi all, I am trying to submit a job to my new cluster with the script: #!/bin/bash #PBS -l walltime=00:5:00 #PBS -l nice=19 #PBS -q linux-spool #PBS -e 2.err #PBS -o 1.out runrun the 2.err: /var/spool/torque/mom_priv/ jobs/77.server.SC: line 8: runrun: command not found But, to be successful in the process is necessary to copy the software (runrun) to the nodes manually . Is mandatory to use NFS with torque ? -- Fabio Andrijauskas http://www.marvinproject.org http://fabio.andrijauskas.com.br/ C?us limpos...... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/fd06d4e2/attachment.html From biozit.listas at gmail.com Thu Jan 24 12:38:55 2013 From: biozit.listas at gmail.com (Fabio Andrijauskas) Date: Thu, 24 Jan 2013 17:38:55 -0200 Subject: [torqueusers] Job can't find the execubtale In-Reply-To: <242421BFAF465844BE24EB90BB97E2210FA357D3@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210FA357D3@ITSDAG1D.its.iastate.edu> Message-ID: Thank you very much 2013/1/24 Coyle, James J [ITACD] > Fabio,**** > > ** ** > > How else is it going to find it your executable?**** > > ** ** > > It is very common to use a common filesystem, such as NFS or Lustre .*** > * > > You can copy the executable, but yu alsos need all data, etc.**** > > ** ** > > I?d also put **** > > cd $PBS_O_WORKDIR**** > > ** ** > > in the script before runrun, and probably also use**** > > ./runrun in case ./ is not in your path.**** > > ** ** > > ** ** > > James Coyle, PhD**** > > High Performance Computing Group **** > > Iowa State Univ. **** > > web: http://jjc.public.iastate.edu/ ** > ** > > ** ** > > ** ** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Fabio Andrijauskas > *Sent:* Thursday, January 24, 2013 6:46 AM > *To:* torqueusers at supercluster.org > *Subject:* [torqueusers] Job can't find the execubtale**** > > ** ** > > Hi all,**** > > I am trying to submit a job to my new cluster with the script: > > #!/bin/bash > #PBS -l walltime=00:5:00 > #PBS -l nice=19 > #PBS -q linux-spool > #PBS -e 2.err > #PBS -o 1.out > > runrun**** > > the 2.err: /var/spool/torque/mom_priv/**** > > jobs/77.server.SC: line 8: runrun: command not found**** > > ** ** > > But, to be successful in the process is necessary to copy the software > (runrun) to the nodes manually . Is mandatory to use NFS with torque ?**** > > > > -- > Fabio Andrijauskas > http://www.marvinproject.org**** > > http://fabio.andrijauskas.com.br/**** > > C?us limpos......**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Fabio Andrijauskas http://www.marvinproject.org http://fabio.andrijauskas.com.br/ C?us limpos...... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/7788e425/attachment.html From msbritt at umich.edu Thu Jan 24 13:26:14 2013 From: msbritt at umich.edu (Matt Britt) Date: Thu, 24 Jan 2013 15:26:14 -0500 Subject: [torqueusers] Warewulf NHC 1.2.2 Release In-Reply-To: <20130123165859.GW8827@lbl.gov> References: <20130123012352.GQ8827@lbl.gov> <20130123165859.GW8827@lbl.gov> Message-ID: Thanks Michael - that got me pointed in the right direction. We're just using /etc/passwd, and it should be up to date. The function using the time was 'check_ps_daemon sshd root': [root at nyx5506 msbritt]# time nhc (with check_ps_daemon) real 0m5.785s user 0m5.565s sys 0m0.101s [root at nyx5506 msbritt]# !vim vim /etc/nhc/nhc.conf [root at nyx5506 msbritt]# time nhc (without check_ps_daemon) real 0m0.185s user 0m0.109s sys 0m0.055s Thanks again, - Matt On Wed, Jan 23, 2013 at 11:59 AM, Michael Jennings wrote: > On Wednesday, 23 January 2013, at 11:37:18 (-0500), > Matthew Britt wrote: > > > Thanks Michael. I've upgraded from 1.2.1 to 1.2.2 and tested on > > three different nodes, and the time to run went from an average of > > .45 seconds to 3.5-5 seconds (same config). How do you enable debug > > mode and does it give timing for each test being done so I can see > > where the time is going? > > Do you use LDAP/NIS/etc. or have any userids or UIDs missing from your > passwd file? Those are the most likely culprit. While previously > failed lookups would be ignored, it's now using "getent" to try to > resolve them. > > Debug mode can be enabled from the config, /etc/sysconfig/nhc, or the > runtime environment by setting DEBUG=1. It won't give timing info, > though; for that, you might want to use "bash -x /usr/sbin/nhc" > instead. That should allow you to see where the blocking is > happening. > > If it turns out to be the fallback code, it's easy enough to make it > disable-able. :-) > > Thanks for the feedback! > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/a30eeb2b/attachment-0001.html From mej at lbl.gov Thu Jan 24 13:51:22 2013 From: mej at lbl.gov (Michael Jennings) Date: Thu, 24 Jan 2013 12:51:22 -0800 Subject: [torqueusers] Warewulf NHC 1.2.2 Release In-Reply-To: References: <20130123012352.GQ8827@lbl.gov> <20130123165859.GW8827@lbl.gov> Message-ID: <20130124205121.GK8827@lbl.gov> On Thursday, 24 January 2013, at 15:26:14 (-0500), Matt Britt wrote: > Thanks Michael - that got me pointed in the right direction. We're > just using /etc/passwd, and it should be up to date. The function > using the time was 'check_ps_daemon sshd root': > > [root at nyx5506 msbritt]# time nhc (with check_ps_daemon) > > > > > > > real 0m5.785s > user 0m5.565s > sys 0m0.101s > [root at nyx5506 msbritt]# !vim > vim /etc/nhc/nhc.conf > [root at nyx5506 msbritt]# time nhc (without check_ps_daemon) > > real 0m0.185s > user 0m0.109s > sys 0m0.055s Wow, that's quite a difference. :-) Is that the only check_ps_* check in your configuration? I'm guessing it is based on the time delay. What happens is this: the first time you use one of the process-based checks, NHC will run the "ps" command to gather information on all your system processes. This can, as you're seeing, take quite a bit of time on a heavily-loaded compute node. However, it only needs to do this once; if you use one ps-based check, you can use as many as you want because you've already "taken the hit" of the subprocess overhead. Subsequent checks will used the cached data instead of launching "ps" again. Glad you found the culprit! NHC tries to be as efficient as possible in everything it does, but it's up to each site to determine how they want to balance the tradeoffs between longer/shorter execution time for NHC and more/less comprehensive assessments of node health. I tried to make it as easy as possible to measure and evaluate those tradeoffs; hopefully I succeeded. :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From msbritt at umich.edu Thu Jan 24 13:56:18 2013 From: msbritt at umich.edu (Matt Britt) Date: Thu, 24 Jan 2013 15:56:18 -0500 Subject: [torqueusers] Warewulf NHC 1.2.2 Release In-Reply-To: <20130124205121.GK8827@lbl.gov> References: <20130123012352.GQ8827@lbl.gov> <20130123165859.GW8827@lbl.gov> <20130124205121.GK8827@lbl.gov> Message-ID: It is the only check_ps we're using, but after your explanation, I'm going to stick more in :) Thanks again, - Matt On Thu, Jan 24, 2013 at 3:51 PM, Michael Jennings wrote: > On Thursday, 24 January 2013, at 15:26:14 (-0500), > Matt Britt wrote: > > > Thanks Michael - that got me pointed in the right direction. We're > > just using /etc/passwd, and it should be up to date. The function > > using the time was 'check_ps_daemon sshd root': > > > > [root at nyx5506 msbritt]# time nhc (with check_ps_daemon) > > > > > > > > > > > > > > real 0m5.785s > > user 0m5.565s > > sys 0m0.101s > > [root at nyx5506 msbritt]# !vim > > vim /etc/nhc/nhc.conf > > [root at nyx5506 msbritt]# time nhc (without check_ps_daemon) > > > > real 0m0.185s > > user 0m0.109s > > sys 0m0.055s > > Wow, that's quite a difference. :-) > > Is that the only check_ps_* check in your configuration? I'm guessing > it is based on the time delay. > > What happens is this: the first time you use one of the process-based > checks, NHC will run the "ps" command to gather information on all > your system processes. This can, as you're seeing, take quite a bit > of time on a heavily-loaded compute node. However, it only needs to > do this once; if you use one ps-based check, you can use as many as > you want because you've already "taken the hit" of the subprocess > overhead. Subsequent checks will used the cached data instead of > launching "ps" again. > > Glad you found the culprit! NHC tries to be as efficient as possible > in everything it does, but it's up to each site to determine how they > want to balance the tradeoffs between longer/shorter execution time > for NHC and more/less comprehensive assessments of node health. I > tried to make it as easy as possible to measure and evaluate those > tradeoffs; hopefully I succeeded. :-) > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/ed736365/attachment.html From dbeer at adaptivecomputing.com Thu Jan 24 15:08:39 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 24 Jan 2013 15:08:39 -0700 Subject: [torqueusers] remote submit hosts, job naming issues In-Reply-To: References: Message-ID: Glen, Which version of TORQUE is this? This seems like a bug that was fixed at some point. David On Thu, Jan 24, 2013 at 9:36 AM, Glen Beane wrote: > Hi everyone, > > We have a few virtual machines that run some applications that submit > jobs to our cluster. These applications are using the pbs_python > library, which works fine. However, I've noticed some annoying > behavior using the TORQUE clients to manage jobs from that submit > host. > > > gbeane at submit_host:~> echo "sleep 600" | qsub -l > nodes=1:ppn=1,walltime=00:10:00 > 251428.cluster_name > gbeane at submit_host::~> qdel 251428 > qdel: Unknown Job Id 251428.cluster_name.mydomain.tld > gbeane at submit_host::~> qdel 251428.cluster_name > qdel: Unknown Job Id 251428.cluster_name.mydomain.tld > gbeane at submit_host::~> qdel 251428.cluster_name at cluster_name > gbeane at submit_host::~> > > > When the clients were compiled the hostname (short name) of our > cluster head node was set as the default server. Does anyone know of > a way to get around this? Ideally I would like "qdel 251428" to work, > or at least "qdel 251428.cluster_name". > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/1c12f355/attachment.html From l.flis at cyf-kr.edu.pl Thu Jan 24 15:57:07 2013 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Thu, 24 Jan 2013 23:57:07 +0100 Subject: [torqueusers] torque: 2.5.x and blocking sockets Message-ID: <5101BC43.8070809@cyf-kr.edu.pl> Hi All, This is the question for experienced torque developers before I brake something up by fixing it wrong. We are facing pbs_server lockups from time to time. In such case server restart is required to make it work again. We are running torque 2.5.12 strace of the locked pbs_server process shows unfinished write() syscall which waits this way forever. gdb backtrace tells more: warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffcf371000 0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6 (gdb) bt #0 0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6 #1 0x00000033dcc13f67 in write_nonblocking_socket (fd=165, buf=0x283a9ac4, count=10205738) at ../Libifl/nonblock.c:31 #2 0x00000033dcc1f7ef in DIS_tcp_wflush (fd=165) at ../Libifl/tcp_dis.c:377 #3 0x000000000042869a in dis_reply_write (sfds=165, preply=0x134e3748) at reply_send.c:188 #4 0x0000000000428815 in reply_send (request=0x134e32d0) at reply_send.c:283 #5 0x0000000000443aea in req_stat_job_step2 (cntl=0x1089a940) at req_stat.c:725 #6 0x0000000000442ebd in req_stat_job (preq=0x134e32d0) at req_stat.c:308 #7 0x00000000004272f5 in dispatch_request (sfds=165, request=0x134e32d0) at process_request.c:984 #8 0x000000000042701e in process_request (sfds=165) at process_request.c:730 #9 0x00000033dcc2cdca in wait_request (waittime=1, SState=0x747418) at ../Libnet/net_server.c:508 #10 0x0000000000423b6c in main_loop () at pbsd_main.c:1203 #11 0x0000000000424b07 in main (argc=3, argv=0x7fffcf288e58) at pbsd_main.c:1802 (gdb) For me it looks like write_nonblocking_socket function sometimes is getting blocking socket for some reason and then waits on write forever In such case timeout check doesn't work because it is constructed with assumption that write never blocks here This way some misbehaving client with unstable network of firewall problems can cause pbs_server to stop serving requests I see no checks whether socket is blocking or not in this function. On the other hand read_nonblocking_socket is doing such checks and turns O_NONBLOCK if necessary As write_nonblocking_socket is quite frequently used piece of code I would like to ask if changing blocking-mode settings for socket inside the function may brake some functionality or slow things down significantly? Any one can confirm such pbs_server hangs has been seen around the globe? PS> I would like to remind you of unnoficial #torque IRC channel on freenode network where you sometimes can get quick help :) Best Regards -- Lukasz Flis From glen.beane at gmail.com Thu Jan 24 16:44:01 2013 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Thu, 24 Jan 2013 18:44:01 -0500 Subject: [torqueusers] remote submit hosts, job naming issues In-Reply-To: References: Message-ID: This is 2.5. Glad to know it may be fixed, our sysadmin is upgrading to torque 4 next week. Sent from my iPhone On Jan 24, 2013, at 5:08 PM, David Beer wrote: > Glen, > > Which version of TORQUE is this? This seems like a bug that was fixed at some point. > > David > > On Thu, Jan 24, 2013 at 9:36 AM, Glen Beane wrote: >> Hi everyone, >> >> We have a few virtual machines that run some applications that submit >> jobs to our cluster. These applications are using the pbs_python >> library, which works fine. However, I've noticed some annoying >> behavior using the TORQUE clients to manage jobs from that submit >> host. >> >> >> gbeane at submit_host:~> echo "sleep 600" | qsub -l nodes=1:ppn=1,walltime=00:10:00 >> 251428.cluster_name >> gbeane at submit_host::~> qdel 251428 >> qdel: Unknown Job Id 251428.cluster_name.mydomain.tld >> gbeane at submit_host::~> qdel 251428.cluster_name >> qdel: Unknown Job Id 251428.cluster_name.mydomain.tld >> gbeane at submit_host::~> qdel 251428.cluster_name at cluster_name >> gbeane at submit_host::~> >> >> >> When the clients were compiled the hostname (short name) of our >> cluster head node was set as the default server. Does anyone know of >> a way to get around this? Ideally I would like "qdel 251428" to work, >> or at least "qdel 251428.cluster_name". >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > David Beer | Senior Software Engineer > Adaptive Computing > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/db514c78/attachment-0001.html From dbeer at adaptivecomputing.com Thu Jan 24 16:49:46 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 24 Jan 2013 16:49:46 -0700 Subject: [torqueusers] remote submit hosts, job naming issues In-Reply-To: References: Message-ID: Glen, Which 2.5? It may be fixed there. David On Thu, Jan 24, 2013 at 4:44 PM, wrote: > This is 2.5. Glad to know it may be fixed, our sysadmin is upgrading to > torque 4 next week. > > Sent from my iPhone > > On Jan 24, 2013, at 5:08 PM, David Beer > wrote: > > Glen, > > Which version of TORQUE is this? This seems like a bug that was fixed at > some point. > > David > > On Thu, Jan 24, 2013 at 9:36 AM, Glen Beane wrote: > >> Hi everyone, >> >> We have a few virtual machines that run some applications that submit >> jobs to our cluster. These applications are using the pbs_python >> library, which works fine. However, I've noticed some annoying >> behavior using the TORQUE clients to manage jobs from that submit >> host. >> >> >> gbeane at submit_host:~> echo "sleep 600" | qsub -l >> nodes=1:ppn=1,walltime=00:10:00 >> 251428.cluster_name >> gbeane at submit_host::~> qdel 251428 >> qdel: Unknown Job Id 251428.cluster_name.mydomain.tld >> gbeane at submit_host::~> qdel 251428.cluster_name >> qdel: Unknown Job Id 251428.cluster_name.mydomain.tld >> gbeane at submit_host::~> qdel 251428.cluster_name at cluster_name >> gbeane at submit_host::~> >> >> >> When the clients were compiled the hostname (short name) of our >> cluster head node was set as the default server. Does anyone know of >> a way to get around this? Ideally I would like "qdel 251428" to work, >> or at least "qdel 251428.cluster_name". >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > > -- > David Beer | Senior Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/ccf57b6c/attachment.html From mej at lbl.gov Thu Jan 24 18:39:15 2013 From: mej at lbl.gov (Michael Jennings) Date: Thu, 24 Jan 2013 17:39:15 -0800 Subject: [torqueusers] hwloc error torque 4.1.3 In-Reply-To: <5100C54C.8040505@vpac.org> References: <20121210192541.GM8827@lbl.gov> <20121210213613.GN8827@lbl.gov> <9BB8F931-7D31-4C37-B68C-04191B237265@anl.gov> <50EA6348.6060101@vpac.org> <20130116013347.GP8827@lbl.gov> <5100C54C.8040505@vpac.org> Message-ID: <20130125013914.GO8827@lbl.gov> On Thursday, 24 January 2013, at 16:23:24 (+1100), Craig West wrote: > Testing this is not simple. Your patch doesn't modify the > "configure" script, which means I need to run autogen.sh. But I > already know that if I run autogen.sh then I won't have the problem. Yep, I'm basically in the same boat. I can't test it either. :-/ But according to the documentation, it's the correct fix, so I've pushed it upstream (at least for 4.2.0). Hopefully it won't come up in future releases. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From deepak_shingan at persistent.co.in Thu Jan 24 21:00:32 2013 From: deepak_shingan at persistent.co.in (Deepak Shingan) Date: Fri, 25 Jan 2013 04:00:32 +0000 Subject: [torqueusers] Can we use torque 4.2.0 for single machine acting as both server and node Message-ID: <2F92A153181B8740B9ADB294A1A2918E5DA56A50@HJ-MBX2.persistent.co.in> Hi, I am new to torque and wanted to do some proof of concept around some requirement but I don't have cluster setup. So is it possible to run torque 4.2.0 on single machine with multiple core acting as both server and compute node. If yes what are the changes I need to do for setup? Regards, -Deepak DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130125/a2bb6d7d/attachment.html From paradisehit at gmail.com Fri Jan 25 02:19:31 2013 From: paradisehit at gmail.com (shixing) Date: Fri, 25 Jan 2013 17:19:31 +0800 Subject: [torqueusers] How to set queue's max nodes? Message-ID: Hi, all: Recently I have set up a cluster with 200 nodes. And this cluster is designed for serving several team. And I want split the nodes to some queues. I have set the queue attr like this: set queue team1 resources_available.nodect = 3 But when I submit a job applying nodes >3, it will also run successfully. The submit command is : *echo "sleep 100" | qsub -l nodes=4 -q team1* So how can I set the max nodes for the queues? -- Best wishes! My Friend~ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130125/cded3bb1/attachment.html From paradisehit at gmail.com Fri Jan 25 03:13:17 2013 From: paradisehit at gmail.com (shixing) Date: Fri, 25 Jan 2013 18:13:17 +0800 Subject: [torqueusers] some queue does not contain any compute node to run job In-Reply-To: References: Message-ID: Hi, I found that the if I set acl_host, but does not set acl_hosts_enable will cause this problem, thanks. On Tue, Nov 13, 2012 at 9:11 PM, shixing wrote: > Hi, folks: > Today I restart maui, I found that Jon in queue "batch2" can't sched > to run, when I find there are such log in maui.log: > > 11/13 20:39:35 INFO: job 11260 rejected, partition DEFAULT > (classes not supported '[batch2 1:0]') > > While the whole cluster's nodes are 199*24cpu. But queue batch2 does > not contain any nodes. > > My torque is 2.5.11?and maui is 3.3.1.? > > The configure are in the attachment. > -- > Best wishes! > My Friend~ > -- Best wishes! My Friend~ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130125/88b40f5b/attachment.html From andre.gemuend at scai.fraunhofer.de Fri Jan 25 03:44:35 2013 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Fri, 25 Jan 2013 11:44:35 +0100 (CET) Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: Message-ID: <547764224.1285948.1359110675563.JavaMail.root@scai.fraunhofer.de> I don't think that setting resources_available.nodect has any effect. Have you tried resources_max.nodect? Greetings ----- Urspr?ngliche Mail ----- > > Hi, all: > Recently I have set up a cluster with 200 nodes. And this cluster is > designed for serving several team. And I want split the nodes to > some queues. > I have set the queue attr like this: > set queue team1 resources_available.nodect = 3 > > > But when I submit a job applying nodes >3, it will also run > successfully. The submit command is : > echo "sleep 100" | qsub -l nodes=4 -q team1 > > > So how can I set the max nodes for the queues? > -- > Best wishes! > My Friend~ > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From gus at ldeo.columbia.edu Fri Jan 25 10:51:48 2013 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 25 Jan 2013 12:51:48 -0500 Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: <547764224.1285948.1359110675563.JavaMail.root@scai.fraunhofer.de> References: <547764224.1285948.1359110675563.JavaMail.root@scai.fraunhofer.de> Message-ID: <5102C634.40803@ldeo.columbia.edu> Hi Shixing Have you tried: set queue myqueue resources_max.nodes = 10 More info: http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm under "Assigning queue resource limits". I hope this helps, Gus On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: > I don't think that setting resources_available.nodect has any effect. > Have you tried resources_max.nodect? > > Greetings > > ----- Urspr?ngliche Mail ----- >> >> Hi, all: >> Recently I have set up a cluster with 200 nodes. And this cluster is >> designed for serving several team. And I want split the nodes to >> some queues. >> I have set the queue attr like this: >> set queue team1 resources_available.nodect = 3 >> >> >> But when I submit a job applying nodes>3, it will also run >> successfully. The submit command is : >> echo "sleep 100" | qsub -l nodes=4 -q team1 >> >> >> So how can I set the max nodes for the queues? >> -- >> Best wishes! >> My Friend~ >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > From toth at fi.muni.cz Fri Jan 25 15:15:37 2013 From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=) Date: Fri, 25 Jan 2013 23:15:37 +0100 Subject: [torqueusers] [torquedev] torque: 2.5.x and blocking sockets In-Reply-To: <5101BC43.8070809@cyf-kr.edu.pl> References: <5101BC43.8070809@cyf-kr.edu.pl> Message-ID: <51030409.1070802@fi.muni.cz> > Any one can confirm such pbs_server hangs has been seen around the globe? Yup, we had this issue. Trivial to fix though. Just add this around the write in the code: #define WFLUSH_TIMEOUT 60 alarm(WFLUSH_TIMEOUT); i = write(fd, pb, ct); alarm(0); You may need to modify the code slightly, we are using a heavily patched fork of Torque. -- Mgr. Simon Toth From knielson at adaptivecomputing.com Fri Jan 25 16:04:28 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 25 Jan 2013 16:04:28 -0700 Subject: [torqueusers] [torquedev] torque: 2.5.x and blocking sockets In-Reply-To: <5101BC43.8070809@cyf-kr.edu.pl> References: <5101BC43.8070809@cyf-kr.edu.pl> Message-ID: On Thu, Jan 24, 2013 at 3:57 PM, Lukasz Flis wrote: > Hi All, > > This is the question for experienced torque developers before I > brake something up by fixing it wrong. > > We are facing pbs_server lockups from time to time. In such case server > restart is required to make it work again. > > We are running torque 2.5.12 > > strace of the locked pbs_server process shows unfinished write() syscall > which waits this way forever. > > gdb backtrace tells more: > warning: no loadable sections found in added symbol-file system-supplied > DSO at 0x7fffcf371000 > 0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6 > (gdb) bt > #0 0x0000003b300c6860 in __write_nocancel () from /lib64/libc.so.6 > #1 0x00000033dcc13f67 in write_nonblocking_socket (fd=165, > buf=0x283a9ac4, count=10205738) > at ../Libifl/nonblock.c:31 > #2 0x00000033dcc1f7ef in DIS_tcp_wflush (fd=165) at > ../Libifl/tcp_dis.c:377 > #3 0x000000000042869a in dis_reply_write (sfds=165, preply=0x134e3748) > at reply_send.c:188 > #4 0x0000000000428815 in reply_send (request=0x134e32d0) at > reply_send.c:283 > #5 0x0000000000443aea in req_stat_job_step2 (cntl=0x1089a940) at > req_stat.c:725 > #6 0x0000000000442ebd in req_stat_job (preq=0x134e32d0) at req_stat.c:308 > #7 0x00000000004272f5 in dispatch_request (sfds=165, > request=0x134e32d0) at process_request.c:984 > #8 0x000000000042701e in process_request (sfds=165) at > process_request.c:730 > #9 0x00000033dcc2cdca in wait_request (waittime=1, SState=0x747418) at > ../Libnet/net_server.c:508 > #10 0x0000000000423b6c in main_loop () at pbsd_main.c:1203 > #11 0x0000000000424b07 in main (argc=3, argv=0x7fffcf288e58) at > pbsd_main.c:1802 > (gdb) > > > For me it looks like write_nonblocking_socket function sometimes is > getting blocking socket for some reason and then waits on write forever > > In such case timeout check doesn't work because it is constructed with > assumption that write never blocks here > > This way some misbehaving client with unstable network of firewall > problems can cause pbs_server to stop serving requests > > I see no checks whether socket is blocking or not in this function. > On the other hand read_nonblocking_socket is doing such checks and turns > O_NONBLOCK if necessary > > As write_nonblocking_socket is quite frequently used piece of code I > would like to ask if changing blocking-mode settings for socket inside > the function may brake some functionality or slow things down > significantly? > > Any one can confirm such pbs_server hangs has been seen around the globe? > > > PS> I would like to remind you of unnoficial #torque IRC channel on > freenode network where you sometimes can get quick help :) > > Best Regards > -- > Lukasz Flis > Lukasz, >From what you are describing and the backtrace I would say you are correct. it seems we are getting stuck on a blocking write. Ensuring this is non-blocking is the right thing to do. Something else to think about, if a write gets blocked it is because the receiving side is not reading the data. Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130125/161bcbb9/attachment.html From bdandrus at nps.edu Fri Jan 25 18:27:04 2013 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Sat, 26 Jan 2013 01:27:04 +0000 Subject: [torqueusers] Can we use torque 4.2.0 for single machine acting as both server and node In-Reply-To: <2F92A153181B8740B9ADB294A1A2918E5DA56A50@HJ-MBX2.persistent.co.in> References: <2F92A153181B8740B9ADB294A1A2918E5DA56A50@HJ-MBX2.persistent.co.in> Message-ID: Deepak, Yes it is possible. And if you install pbs_mom and pbs_server along with a scheduler (eg: pbs_sched) you should be just fine. Caveats: This in and of itself is not practical. If you are on the box, you should be aware of the resources and not need a scheduler at all. If others are also logging on, well, pbs will have no idea what resources they are using, so your box could be brought to its knees. I am assuming you are looking at this environment as a learning tool, not a production environment. That said, you may want to set your nodes file to specify 1 less core than there are on the box. This will at least ensure your jobs don't walk all over some of the other things you will have running. Same with memory, you may want to tell torque to manage less than there really is to prevent non-torque stuff from being walked on by torque stuff :) Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Deepak Shingan Sent: Thursday, January 24, 2013 8:01 PM To: torqueusers at supercluster.org Subject: [torqueusers] Can we use torque 4.2.0 for single machine acting as both server and node Hi, I am new to torque and wanted to do some proof of concept around some requirement but I don't have cluster setup. So is it possible to run torque 4.2.0 on single machine with multiple core acting as both server and compute node. If yes what are the changes I need to do for setup? Regards, -Deepak DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130126/75c12a4e/attachment.html From bdandrus at nps.edu Fri Jan 25 19:37:14 2013 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Sat, 26 Jan 2013 02:37:14 +0000 Subject: [torqueusers] PBS environmental variables and -V In-Reply-To: References: <6E440804-F6BF-4609-86C8-662DAADEDA82@umich.edu> Message-ID: We use environment modules too. I have folks put those in their scripts. #PBS -l procs=10 module compile/pgi mpi/openmpi mpirun myprog Now when they say my program is not running, I can easily replicate their environment because it is all in the script AND they don't shoot themselves in the foot if they are submitting a job from within a job. What could work if you are using -V is something like: for val in $(env|awk -F= '/^PBS/{print $1}'); do echo $val=${!val}>/tmp/$PBS_JOBID.sh; unset $val; done qsub -V somescript.sh source /tmp/$PBS_JOBID.sh rm /tmp/$PBS_JOBID.sh Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Wednesday, January 23, 2013 3:11 PM To: Torque Users Mailing List Subject: Re: [torqueusers] PBS environmental variables and -V Matt, I guess you would call it a bug. I think it is simply years of maintenance without knowledge of the original intent. Rick made a ticket for this and we will see what it currently does and try to figure out what it should do and then let everyone know. Ken On Wed, Jan 23, 2013 at 3:30 PM, Matthew Britt > wrote: As a counter argument, the value of PBS_NODEFILE isn't getting set to the same value for the second job. Like PBS_NODEFILE, I would expect the PBS_O variables get set correctly for the submitted job and not the environment passed into it. It isn't consistent w/ PBS_O variables either, as PBS_O_QUEUE is overwritten w/ the correct value (I exported a different value into PBS_O_QUEUE to check). fwiw, we're using Environmental Modules, so we set several variables for software packages, like license servers, process launchers (like hydra), etc. The users might not be aware that these variables are necessary, so we have the users load appropriate module(s) and submit with -V. If the PBS_O_ variables are treated differently than PBS_ variables, that's fine; I was curious if it was by design or was a bug. - Matt -------------------------------------------- Matthew Britt CAEN HPC Group - College of Engineering msbritt at umich.edu On Jan 23, 2013, at 2:54 PM, "Andrus, Brian Contractor" > wrote: > Seems to me that would be by design and you need to be aware of it. > -V basically just takes the output of 'env' and sets anything that is set. > Since you already have PBS_* variables, I would expect them to be set already. > > This could be handled by pbs_mom if it were to first sent the -V stuff and then set the PBS_* stuff. > You can do this by iterating through and 'unset' all the PBS_* variables before doing the qsub. > > I tend to rend this moot by highly discouraging the use of -V > It's use makes it difficult to troubleshoot when folks use things like "./a.out" to run their programs. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Matthew Britt >> Sent: Tuesday, January 22, 2013 11:22 AM >> To: Torque Users Mailing List >> Subject: [torqueusers] PBS environmental variables and -V >> >> Hello everyone. What is the expected behavior in precedence between >> PBS environmental variables and -V when a job is submitted from an >> interactive job. With torque 4.1.4 (and possibly earlier), the variables from >> the environment of the interactive shell are passed along w/ the newly >> submitted job (either interactive or batch). We've seen both PBS_O_HOST >> and PBS_O_WORKDIR be set to values of the first job rather than the >> attributes of the second job. >> >> As an example: >> >> [msbritt at nyx ~]$ cd bin >> [msbritt at nyx bin]$ pwd >> /home/msbritt/bin >> [msbritt at nyx bin]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A msbritt_flux - >> V >> qsub: waiting for job 9445802.nyx.engin.umich.edu to start >> qsub: job 9445802.nyx.engin.umich.edu ready >> >> [msbritt at nyx5515 ~]$ echo $PBS_O_WORKDIR >> /home/msbritt/bin >> [msbritt at nyx5515 ~]$ echo $PBS_O_HOST >> nyx.engin.umich.edu >> [msbritt at nyx5515 ~]$ pwd >> /home/msbritt >> >> [msbritt at nyx5515 ~]$ qsub -I -l nodes=1,walltime=5:00 -q flux -A >> msbritt_flux -V >> qsub: waiting for job 9445813.nyx.engin.umich.edu to start >> qsub: job 9445813.nyx.engin.umich.edu ready >> >> [msbritt at nyx5623 ~]$ pwd >> /home/msbritt >> [msbritt at nyx5623 ~]$ echo $PBS_O_WORKDIR >> /home/msbritt/bin (arguably should be /home/msbritt) >> [msbritt at nyx5623 ~]$ echo $PBS_O_HOST >> nyx.engin.umich.edu (arguably should be nyx5515) >> >> >> Should -V not read the PBS_O_* variables on job submission or at least be >> overwritten and correctly set in the next job, or should -V trump ? >> >> Thanks, >> - Matt >> >> -------------------------------------------- >> Matthew Britt >> CAEN HPC Group - College of Engineering >> msbritt at umich.edu >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130126/7571dc27/attachment-0001.html From andre.gemuend at scai.fraunhofer.de Sat Jan 26 04:32:16 2013 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Sat, 26 Jan 2013 12:32:16 +0100 (CET) Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: <5102C634.40803@ldeo.columbia.edu> Message-ID: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> Hi, I'm a bit surprised by that. Since when is nodes an integer in Torque? It used to be a string in earlier versions. Greetings Andr? ----- Urspr?ngliche Mail ----- > Hi Shixing > > Have you tried: > > set queue myqueue resources_max.nodes = 10 > > More info: > > http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm > > under "Assigning queue resource limits". > > I hope this helps, > Gus > > On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: > > I don't think that setting resources_available.nodect has any > > effect. > > Have you tried resources_max.nodect? > > > > Greetings > > > > ----- Urspr?ngliche Mail ----- > >> > >> Hi, all: > >> Recently I have set up a cluster with 200 nodes. And this cluster > >> is > >> designed for serving several team. And I want split the nodes to > >> some queues. > >> I have set the queue attr like this: > >> set queue team1 resources_available.nodect = 3 > >> > >> > >> But when I submit a job applying nodes>3, it will also run > >> successfully. The submit command is : > >> echo "sleep 100" | qsub -l nodes=4 -q team1 > >> > >> > >> So how can I set the max nodes for the queues? > >> -- > >> Best wishes! > >> My Friend~ > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From bunk at physik.hu-berlin.de Sat Jan 26 06:31:26 2013 From: bunk at physik.hu-berlin.de (Burkhard Bunk) Date: Sat, 26 Jan 2013 14:31:26 +0100 (CET) Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> References: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> Message-ID: Hi, I would support that. The correct form, IMHO, is set queue myqueue resources_max.nodect = 10 Regards, Burkhard Bunk. ---------------------------------------------------------------------- bunk at physik.hu-berlin.de Physics Institute, Humboldt University fax: ++49-30 2093 7628 Newtonstr. 15 phone: ++49-30 2093 7980 12489 Berlin, Germany ---------------------------------------------------------------------- On Sat, 26 Jan 2013, Andr? Gem?nd wrote: > Hi, > > I'm a bit surprised by that. Since when is nodes an integer in Torque? It used to be a string in earlier versions. > > Greetings > Andr? > > ----- Urspr?ngliche Mail ----- >> Hi Shixing >> >> Have you tried: >> >> set queue myqueue resources_max.nodes = 10 >> >> More info: >> >> http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm >> >> under "Assigning queue resource limits". >> >> I hope this helps, >> Gus >> >> On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: >> > I don't think that setting resources_available.nodect has any >> > effect. >> > Have you tried resources_max.nodect? >> > >> > Greetings >> > >> > ----- Urspr?ngliche Mail ----- >> >> >> >> Hi, all: >> >> Recently I have set up a cluster with 200 nodes. And this cluster >> >> is >> >> designed for serving several team. And I want split the nodes to >> >> some queues. >> >> I have set the queue attr like this: >> >> set queue team1 resources_available.nodect = 3 >> >> >> >> >> >> But when I submit a job applying nodes>3, it will also run >> >> successfully. The submit command is : >> >> echo "sleep 100" | qsub -l nodes=4 -q team1 >> >> >> >> >> >> So how can I set the max nodes for the queues? >> >> -- >> >> Best wishes! >> >> My Friend~ >> >> _______________________________________________ >> >> torqueusers mailing list >> >> torqueusers at supercluster.org >> >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> > >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > -- > Andr? Gem?nd > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend at scai.fraunhofer.de > Tel: +49 2241 14-2193 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Sat Jan 26 12:11:04 2013 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sat, 26 Jan 2013 14:11:04 -0500 Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: References: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> Message-ID: I misundertood Shixing's original question. I though he wanted to prevent each job to exceed a certain number of nodes, but what he wants to apply the limit to the sum of all jobs in the queue, correct? Burkhard is right. "nodect" intent seems to be to limit the number of nodes used by all jobs in a specific queue, whereas AFAIK "nodes" limits the number of nodes each job can request when submitted to a specific queue, right? Things may have changed in recent versions, but "nodes" , with the interpretation above, works for me in Torque 2.4.11, with Maui 3.2.6p21. However, "nodect", with the interpretation above, doesn't work for me, as Shixing also noted, even if I add "ppn=8" to my qsub command, to request all processors in my nodes, and try to exhaust the available resources and hit the nodect limit. Maybe there is a way to implement what Shixing wants in Maui? Quoting the Torque Admin Guide, section 4.1.1, "Queues attributes": "nodes integer Specifies the number of nodes " [Note, integer, not a string. Not in this context at least.] "nodect integer Sets the number of nodes available. By default, TORQUE will set the number of nodes available to the number of nodes listed in the $TORQUE_HOME/server_priv/nodes file. nodect can be set to be greater than or less than that number. Generally, it is used to set the node count higher than the number of physical nodes in the cluster." http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm Admittedly, the Guide wording is not very clear. It could include "on a per queue basis", "on a per job basis", or something the like, to clarify the context. The final sentence in "nodect" sounds a bit awkward. Does it work to set the node count *smaller* than the number of physical nodes? Does this depend on the scheduler configuration? [pbs_sched, Maui, Moab] Somebody from Adaptive could clarify. Gus Correa On Jan 26, 2013, at 8:31 AM, Burkhard Bunk wrote: > Hi, > > I would support that. > The correct form, IMHO, is > > set queue myqueue resources_max.nodect = 10 > > Regards, > Burkhard Bunk. > ---------------------------------------------------------------------- > bunk at physik.hu-berlin.de Physics Institute, Humboldt University > fax: ++49-30 2093 7628 Newtonstr. 15 > phone: ++49-30 2093 7980 12489 Berlin, Germany > ---------------------------------------------------------------------- > > On Sat, 26 Jan 2013, Andr? Gem?nd wrote: > >> Hi, >> I'm a bit surprised by that. Since when is nodes an integer in Torque? It used to be a string in earlier versions. >> >> Greetings >> Andr? >> >> ----- Urspr?ngliche Mail ----- >>> Hi Shixing >>> Have you tried: >>> set queue myqueue resources_max.nodes = 10 >>> More info: >>> http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm >>> under "Assigning queue resource limits". >>> I hope this helps, >>> Gus >>> On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: >>> > I don't think that setting resources_available.nodect has any >>> > effect. >>> > Have you tried resources_max.nodect? >>> > >>> > Greetings >>> > >>> > ----- Urspr?ngliche Mail ----- >>> >> >>> >> Hi, all: >>> >> Recently I have set up a cluster with 200 nodes. And this cluster >>> >> is >>> >> designed for serving several team. And I want split the nodes to >>> >> some queues. >>> >> I have set the queue attr like this: >>> >> set queue team1 resources_available.nodect = 3 >>> >> >>> >> >>> >> But when I submit a job applying nodes>3, it will also run >>> >> successfully. The submit command is : >>> >> echo "sleep 100" | qsub -l nodes=4 -q team1 >>> >> >>> >> >>> >> So how can I set the max nodes for the queues? >>> >> -- >>> >> Best wishes! >>> >> My Friend~ >>> >> _______________________________________________ >>> >> torqueusers mailing list >>> >> torqueusers at supercluster.org >>> >> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >>> > >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> -- >> Andr? Gem?nd >> Fraunhofer-Institute for Algorithms and Scientific Computing >> andre.gemuend at scai.fraunhofer.de >> Tel: +49 2241 14-2193 >> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From paradisehit at gmail.com Sun Jan 27 01:21:24 2013 From: paradisehit at gmail.com (shixing) Date: Sun, 27 Jan 2013 16:21:24 +0800 Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: References: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> Message-ID: When I set the queue both with nodect and nodes like this: *create queue team1* *set queue team1 queue_type = Execution* *set queue team1 resources_max.nodect = 2* *set queue team1 resources_max.nodes = 2* *set queue team1 keep_completed = 2592000* *set queue team1 enabled = True* *set queue team1 started = True* I can also submit large jobs and all the nodes used by the running jobs exceed the resources_max.nodect or resources_max.nodes (here are both 2). I submit the jobs 4 times like this: *echo "sleep 100" | qsub -l nodes=1:ppn=4 -q team1* * * And qstat shows the command like this: *1661.vkvm161057 STDIN shubao.sx 0 R team1 * *1662.vkvm161057 STDIN shubao.sx 0 R team1 * *1663.vkvm161057 STDIN shubao.sx 0 R team1 * *1664.vkvm161057 STDIN shubao.sx 0 R team1 * I use torque 4.1.3 and maui 3.3.1. On Sun, Jan 27, 2013 at 3:11 AM, Gustavo Correa wrote: > I misundertood Shixing's original question. > I though he wanted to prevent each job to exceed a certain number of nodes, > but what he wants to apply the limit to the sum of all jobs in the queue, > correct? > > Burkhard is right. > "nodect" intent seems to be to limit the number of nodes used by all jobs > in a specific queue, > whereas AFAIK "nodes" limits the number of nodes each job can request > when submitted to > a specific queue, right? > > Things may have changed in recent versions, but "nodes" , with the > interpretation above, works for me in Torque 2.4.11, with Maui 3.2.6p21. > > However, "nodect", with the interpretation above, doesn't work for me, > as Shixing also noted, even if I add "ppn=8" to my qsub command, > to request all processors in my nodes, and try to exhaust the available > resources and hit the nodect limit. > Maybe there is a way to implement what Shixing wants in Maui? > > Quoting the Torque Admin Guide, section 4.1.1, "Queues attributes": > > "nodes integer Specifies the number of nodes " > [Note, integer, not a string. Not in this context at least.] > > "nodect integer Sets the number of nodes available. By default, > TORQUE will set the number of nodes available to the number of nodes listed > in the $TORQUE_HOME/server_priv/nodes file. nodect can be set to be greater > than or less than that number. Generally, it is used to set the node count > higher than the number of physical nodes in the cluster." > > > http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm > > Admittedly, the Guide wording is not very clear. > It could include "on a per queue basis", "on a per job basis", or > something the like, > to clarify the context. > The final sentence in "nodect" sounds a bit awkward. > Does it work to set the node count *smaller* than the number > of physical nodes? > Does this depend on the scheduler configuration? [pbs_sched, Maui, Moab] > > Somebody from Adaptive could clarify. > > Gus Correa > > On Jan 26, 2013, at 8:31 AM, Burkhard Bunk wrote: > > > Hi, > > > > I would support that. > > The correct form, IMHO, is > > > > set queue myqueue resources_max.nodect = 10 > > > > Regards, > > Burkhard Bunk. > > ---------------------------------------------------------------------- > > bunk at physik.hu-berlin.de Physics Institute, Humboldt University > > fax: ++49-30 2093 7628 Newtonstr. 15 > > phone: ++49-30 2093 7980 12489 Berlin, Germany > > ---------------------------------------------------------------------- > > > > On Sat, 26 Jan 2013, Andr? Gem?nd wrote: > > > >> Hi, > >> I'm a bit surprised by that. Since when is nodes an integer in Torque? > It used to be a string in earlier versions. > >> > >> Greetings > >> Andr? > >> > >> ----- Urspr?ngliche Mail ----- > >>> Hi Shixing > >>> Have you tried: > >>> set queue myqueue resources_max.nodes = 10 > >>> More info: > >>> > http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queueAttributes.htm > >>> under "Assigning queue resource limits". > >>> I hope this helps, > >>> Gus > >>> On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: > >>> > I don't think that setting resources_available.nodect has any > >>> > effect. > >>> > Have you tried resources_max.nodect? > >>> > > >>> > Greetings > >>> > > >>> > ----- Urspr?ngliche Mail ----- > >>> >> > >>> >> Hi, all: > >>> >> Recently I have set up a cluster with 200 nodes. And this cluster > >>> >> is > >>> >> designed for serving several team. And I want split the nodes to > >>> >> some queues. > >>> >> I have set the queue attr like this: > >>> >> set queue team1 resources_available.nodect = 3 > >>> >> > >>> >> > >>> >> But when I submit a job applying nodes>3, it will also run > >>> >> successfully. The submit command is : > >>> >> echo "sleep 100" | qsub -l nodes=4 -q team1 > >>> >> > >>> >> > >>> >> So how can I set the max nodes for the queues? > >>> >> -- > >>> >> Best wishes! > >>> >> My Friend~ > >>> >> _______________________________________________ > >>> >> torqueusers mailing list > >>> >> torqueusers at supercluster.org > >>> >> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> >> > >>> > > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> -- > >> Andr? Gem?nd > >> Fraunhofer-Institute for Algorithms and Scientific Computing > >> andre.gemuend at scai.fraunhofer.de > >> Tel: +49 2241 14-2193 > >> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Best wishes! My Friend~ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130127/55e8aa8c/attachment-0001.html From bunk at physik.hu-berlin.de Sun Jan 27 07:49:27 2013 From: bunk at physik.hu-berlin.de (Burkhard Bunk) Date: Sun, 27 Jan 2013 15:49:27 +0100 (CET) Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: References: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> Message-ID: Hi, the "nodes" resource has always been string-valued, and this cannot be changed without breaking current installations. (The docs must be in error at this point.) Setting a default is ok, as in set queue defq resources_default.nodes = 1:ppn=4 but "resources_min.nodes" and "resources_max.nodes" are considered invalid. As we are on it: The interpretation of "resources_max.nodect" seems to be tricky. With torque alone (FIFO scheduler), it used to act on a per-queue level, but when I introduced Maui, the interpretation changed to a per-job limitation. I found this with torque-2.5.x and maui-3.3, no idea whether it's intentional or a bug. A more precise handling of limits should be possible with Maui's CLASSCFG[] ... and a (space separated) list of settings e.g. for MAXJOB MAXJOBPERUSER MAXNODE MAXNODEPERUSER MAXPROC MAXPROCPERUSER (see Manual, part 6.2.1), but I haven't tested this so far. Regards, Burkhard Bunk. ---------------------------------------------------------------------- bunk at physik.hu-berlin.de Physics Institute, Humboldt University fax: ++49-30 2093 7628 Newtonstr. 15 phone: ++49-30 2093 7980 12489 Berlin, Germany ---------------------------------------------------------------------- On Sun, 27 Jan 2013, shixing wrote: > When I set the queue both with nodect and nodes like this:create queue team1 > set queue team1 queue_type = Execution > set queue team1 resources_max.nodect = 2 > set queue team1 resources_max.nodes = 2 > set queue team1 keep_completed = 2592000 > set queue team1 enabled = True > set queue team1 started = True > > I can also submit large jobs and all the nodes used by the running jobs exceed > the?resources_max.nodect or?resources_max.nodes (here are both 2). > I submit the jobs 4 times like this: > echo "sleep 100" | qsub -l nodes=1:ppn=4 -q team1 > > And qstat shows the command like this: > 1661.vkvm161057 ? ? ? ? ? ?STDIN ? ? ? ? ? ?shubao.sx ? ? ? ? ? ? ?0 R team1 ? > ? ? ? ? > 1662.vkvm161057 ? ? ? ? ? ?STDIN ? ? ? ? ? ?shubao.sx ? ? ? ? ? ? ?0 R team1 ? > ? ? ? ? > 1663.vkvm161057 ? ? ? ? ? ?STDIN ? ? ? ? ? ?shubao.sx ? ? ? ? ? ? ?0 R team1 ? > ? ? ? ? > 1664.vkvm161057 ? ? ? ? ? ?STDIN ? ? ? ? ? ?shubao.sx ? ? ? ? ? ? ?0 R team1? > > I use torque 4.1.3 and maui 3.3.1. > > On Sun, Jan 27, 2013 at 3:11 AM, Gustavo Correa wrote: > I misundertood Shixing's original question. > I though he wanted to prevent each job to exceed a certain number > of nodes, > but what he wants to apply the limit to the sum of all jobs in the > queue, correct? > > Burkhard is right. > "nodect" intent seems to be to limit the number of nodes used by > all jobs in a specific queue, > whereas ?AFAIK "nodes" limits the number of nodes each job can > request when submitted to > a specific queue, right? > > Things may have changed in recent versions, but "nodes" , with the > interpretation above, works for me in Torque 2.4.11, with Maui > 3.2.6p21. > > However, "nodect", with the interpretation above, doesn't work for > me, > as Shixing also noted, even if I add "ppn=8" to my qsub command, > to request all processors in my nodes, and try to exhaust the > available resources and hit the nodect limit. > Maybe there is a way to implement what Shixing wants in Maui? > > Quoting the Torque Admin Guide, section 4.1.1, "Queues > attributes": > > "nodes ?integer Specifies the number of nodes " > [Note, integer, not a string. ?Not in this context at least.] > > "nodect integer ? ? ? ? Sets the number of nodes available. By > default, TORQUE will set the number of nodes available to the > number of nodes listed in the $TORQUE_HOME/server_priv/nodes file. > nodect can be set to be greater than or less than that number. > Generally, it is used to set the node count higher than the number > of physical nodes in the cluster." > > http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queue > Attributes.htm > > Admittedly, the Guide wording is not very clear. > It could include "on a per queue basis", "on a per job basis", or > something the like, > to clarify the context. > The final sentence in "nodect" sounds a bit awkward. > Does it work to set the node count *smaller* than the number > of physical nodes? > Does this depend on the scheduler configuration? ?[pbs_sched, > Maui, Moab] > > Somebody from Adaptive could clarify. > > Gus Correa > > On Jan 26, 2013, at 8:31 AM, Burkhard Bunk wrote: > > > Hi, > > > > I would support that. > > The correct form, IMHO, is > > > > set queue myqueue resources_max.nodect = 10 > > > > Regards, > > Burkhard Bunk. > > > ---------------------------------------------------------------------- > > bunk at physik.hu-berlin.de ? ? ?Physics Institute, Humboldt > University > > fax: ? ?++49-30 2093 7628 ? ? Newtonstr. 15 > > phone: ?++49-30 2093 7980 ? ? 12489 Berlin, Germany > > > ---------------------------------------------------------------------- > > > > On Sat, 26 Jan 2013, Andr? Gem?nd wrote: > > > >> Hi, > >> I'm a bit surprised by that. Since when is nodes an integer in > Torque? It used to be a string in earlier versions. > >> > >> Greetings > >> Andr? > >> > >> ----- Urspr?ngliche Mail ----- > >>> Hi Shixing > >>> Have you tried: > >>> set queue myqueue resources_max.nodes = 10 > >>> More info: > >>>http://docs.adaptivecomputing.com/torque/4-1-3/help.htm#topics/4-serverPolicies/queue > Attributes.htm > >>> under "Assigning queue resource limits". > >>> I hope this helps, > >>> Gus > >>> On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: > >>> > I don't think that setting resources_available.nodect has > any > >>> > effect. > >>> > Have you tried resources_max.nodect? > >>> > > >>> > Greetings > >>> > > >>> > ----- Urspr?ngliche Mail ----- > >>> >> > >>> >> Hi, all: > >>> >> Recently I have set up a cluster with 200 nodes. And this > cluster > >>> >> is > >>> >> designed for serving several team. And I want split the > nodes to > >>> >> some queues. > >>> >> I have set the queue attr like this: > >>> >> set queue team1 resources_available.nodect = 3 > >>> >> > >>> >> > >>> >> But when I submit a job applying nodes>3, it will also run > >>> >> successfully. The submit command is : > >>> >> echo "sleep 100" | qsub -l nodes=4 -q team1 > >>> >> > >>> >> > >>> >> So how can I set the max nodes for the queues? > >>> >> -- > >>> >> Best wishes! > >>> >> My Friend~ > >>> >> _______________________________________________ > >>> >> torqueusers mailing list > >>> >> torqueusers at supercluster.org > >>> >> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> >> > >>> > > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> -- > >> Andr? Gem?nd > >> Fraunhofer-Institute for Algorithms and Scientific Computing > >> andre.gemuend at scai.fraunhofer.de > >> Tel: +49 2241 14-2193 > >> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Best wishes! > My Friend~ > > From paradisehit at gmail.com Sun Jan 27 19:43:23 2013 From: paradisehit at gmail.com (shixing) Date: Mon, 28 Jan 2013 10:43:23 +0800 Subject: [torqueusers] How to set queue's max nodes? In-Reply-To: References: <948752182.1298775.1359199936363.JavaMail.root@scai.fraunhofer.de> Message-ID: I have tested the CLASSCFG[] ... *CLASSCFG[team1] MAXNODE=3* And restarted maui to make the config take effect. But when I submit 3 jobs each apply for 2 nodes, the three jobs are all running. * * *1665.vkvm161057 STDIN shubao.sx 0 R team1 * *1666.vkvm161057 STDIN shubao.sx 0 R eam1 * *1667.vkvm161057 STDIN shubao.sx 0 R eam1 * I think maybe I should have a look at the torque's source code. On Sun, Jan 27, 2013 at 10:49 PM, Burkhard Bunk wrote: > Hi, > > the "nodes" resource has always been string-valued, and this cannot > be changed without breaking current installations. (The docs must be in > error at this point.) > Setting a default is ok, as in > > set queue defq resources_default.nodes = 1:ppn=4 > > but "resources_min.nodes" and "resources_max.nodes" are considered invalid. > > As we are on it: > The interpretation of "resources_max.nodect" seems to be tricky. > With torque alone (FIFO scheduler), it used to act on a per-queue level, > but when I introduced Maui, the interpretation changed to a per-job > limitation. I found this with torque-2.5.x and maui-3.3, no idea whether > it's intentional or a bug. > A more precise handling of limits should be possible with Maui's > > CLASSCFG[] ... > > and a (space separated) list of settings e.g. for > > MAXJOB MAXJOBPERUSER > MAXNODE MAXNODEPERUSER > MAXPROC MAXPROCPERUSER > > (see Manual, part 6.2.1), but I haven't tested this so far. > > > Regards, > Burkhard Bunk. > ------------------------------**------------------------------**---------- > bunk at physik.hu-berlin.de Physics Institute, Humboldt University > fax: ++49-30 2093 7628 Newtonstr. 15 > phone: ++49-30 2093 7980 12489 Berlin, Germany > ------------------------------**------------------------------**---------- > > On Sun, 27 Jan 2013, shixing wrote: > > When I set the queue both with nodect and nodes like this:create queue >> team1 >> set queue team1 queue_type = Execution >> set queue team1 resources_max.nodect = 2 >> set queue team1 resources_max.nodes = 2 >> set queue team1 keep_completed = 2592000 >> set queue team1 enabled = True >> set queue team1 started = True >> >> I can also submit large jobs and all the nodes used by the running jobs >> exceed >> the resources_max.nodect or resources_max.nodes (here are both 2). >> I submit the jobs 4 times like this: >> echo "sleep 100" | qsub -l nodes=1:ppn=4 -q team1 >> >> And qstat shows the command like this: >> 1661.vkvm161057 STDIN shubao.sx 0 R >> team1 >> >> 1662.vkvm161057 STDIN shubao.sx 0 R >> team1 >> >> 1663.vkvm161057 STDIN shubao.sx 0 R >> team1 >> >> 1664.vkvm161057 STDIN shubao.sx 0 R >> team1 >> >> I use torque 4.1.3 and maui 3.3.1. >> >> On Sun, Jan 27, 2013 at 3:11 AM, Gustavo Correa >> wrote: >> I misundertood Shixing's original question. >> I though he wanted to prevent each job to exceed a certain number >> of nodes, >> but what he wants to apply the limit to the sum of all jobs in the >> queue, correct? >> >> Burkhard is right. >> "nodect" intent seems to be to limit the number of nodes used by >> all jobs in a specific queue, >> whereas AFAIK "nodes" limits the number of nodes each job can >> request when submitted to >> a specific queue, right? >> >> Things may have changed in recent versions, but "nodes" , with the >> interpretation above, works for me in Torque 2.4.11, with Maui >> 3.2.6p21. >> >> However, "nodect", with the interpretation above, doesn't work for >> me, >> as Shixing also noted, even if I add "ppn=8" to my qsub command, >> to request all processors in my nodes, and try to exhaust the >> available resources and hit the nodect limit. >> Maybe there is a way to implement what Shixing wants in Maui? >> >> Quoting the Torque Admin Guide, section 4.1.1, "Queues >> attributes": >> >> "nodes integer Specifies the number of nodes " >> [Note, integer, not a string. Not in this context at least.] >> >> "nodect integer Sets the number of nodes available. By >> default, TORQUE will set the number of nodes available to the >> number of nodes listed in the $TORQUE_HOME/server_priv/nodes file. >> nodect can be set to be greater than or less than that number. >> Generally, it is used to set the node count higher than the number >> of physical nodes in the cluster." >> >> http://docs.adaptivecomputing.**com/torque/4-1-3/help.htm#** >> topics/4-serverPolicies/queue >> Attributes.htm >> >> Admittedly, the Guide wording is not very clear. >> It could include "on a per queue basis", "on a per job basis", or >> something the like, >> to clarify the context. >> The final sentence in "nodect" sounds a bit awkward. >> Does it work to set the node count *smaller* than the number >> of physical nodes? >> Does this depend on the scheduler configuration? [pbs_sched, >> Maui, Moab] >> >> Somebody from Adaptive could clarify. >> >> Gus Correa >> >> On Jan 26, 2013, at 8:31 AM, Burkhard Bunk wrote: >> >> > Hi, >> > >> > I would support that. >> > The correct form, IMHO, is >> > >> > set queue myqueue resources_max.nodect = 10 >> > >> > Regards, >> > Burkhard Bunk. >> > >> ------------------------------**------------------------------** >> ---------- >> > bunk at physik.hu-berlin.de Physics Institute, Humboldt >> University >> > fax: ++49-30 2093 7628 Newtonstr. 15 >> > phone: ++49-30 2093 7980 12489 Berlin, Germany >> > >> ------------------------------**------------------------------** >> ---------- >> > >> > On Sat, 26 Jan 2013, Andr? Gem?nd wrote: >> > >> >> Hi, >> >> I'm a bit surprised by that. Since when is nodes an integer in >> Torque? It used to be a string in earlier versions. >> >> >> >> Greetings >> >> Andr? >> >> >> >> ----- Urspr?ngliche Mail ----- >> >>> Hi Shixing >> >>> Have you tried: >> >>> set queue myqueue resources_max.nodes = 10 >> >>> More info: >> >>>http://docs.**adaptivecomputing.com/torque/** >> 4-1-3/help.htm#topics/4-**serverPolicies/queue >> Attributes.htm >> >>> under "Assigning queue resource limits". >> >>> I hope this helps, >> >>> Gus >> >>> On 01/25/2013 05:44 AM, Andr? Gem?nd wrote: >> >>> > I don't think that setting resources_available.nodect has >> any >> >>> > effect. >> >>> > Have you tried resources_max.nodect? >> >>> > >> >>> > Greetings >> >>> > >> >>> > ----- Urspr?ngliche Mail ----- >> >>> >> >> >>> >> Hi, all: >> >>> >> Recently I have set up a cluster with 200 nodes. And this >> cluster >> >>> >> is >> >>> >> designed for serving several team. And I want split the >> nodes to >> >>> >> some queues. >> >>> >> I have set the queue attr like this: >> >>> >> set queue team1 resources_available.nodect = 3 >> >>> >> >> >>> >> >> >>> >> But when I submit a job applying nodes>3, it will also run >> >>> >> successfully. The submit command is : >> >>> >> echo "sleep 100" | qsub -l nodes=4 -q team1 >> >>> >> >> >>> >> >> >>> >> So how can I set the max nodes for the queues? >> >>> >> -- >> >>> >> Best wishes! >> >>> >> My Friend~ >> >>> >> ______________________________**_________________ >> >>> >> torqueusers mailing list >> >>> >> torqueusers at supercluster.org >> >>> >> http://www.supercluster.org/**mailman/listinfo/torqueusers >> >>> >> >> >>> > >> >>> ______________________________**_________________ >> >>> torqueusers mailing list >> >>> torqueusers at supercluster.org >> >>> http://www.supercluster.org/**mailman/listinfo/torqueusers >> >> >> >> -- >> >> Andr? Gem?nd >> >> Fraunhofer-Institute for Algorithms and Scientific Computing >> >> andre.gemuend at scai.fraunhofer.**de >> >> Tel: +49 2241 14-2193 >> >> /C=DE/O=Fraunhofer/OU=SCAI/OU=**People/CN=Andre Gemuend >> >> ______________________________**_________________ >> >> torqueusers mailing list >> >> torqueusers at supercluster.org >> >> http://www.supercluster.org/**mailman/listinfo/torqueusers >> > ______________________________**_________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/**mailman/listinfo/torqueusers >> >> ______________________________**_________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/**mailman/listinfo/torqueusers >> >> >> >> >> -- >> Best wishes! >> My Friend~ >> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Best wishes! My Friend~ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130128/b195ae90/attachment-0001.html From JMRUSHTON at qinetiq.com Mon Jan 28 04:09:45 2013 From: JMRUSHTON at qinetiq.com (Rushton Martin) Date: Mon, 28 Jan 2013 11:09:45 -0000 Subject: [torqueusers] UC Can we use torque 4.2.0 for single machine acting as both server and node Message-ID: <20130128113904.C4D7311C8038@http.supercluster.org> Brian, I'm afraid I must take issue with your caveat in the general case; it all depends upon the size of the box. Consider our previous machine which was a 64 CPU Altix and we used Torque to handle the batch work. Interactive logins were constrained by cpusets to the first 8 CPUs, the remainder were assigned to the queue. Memory usage for interactive work was small compared to the machine as a whole; with sensible limits Torque could allocate lots of memory to a single job when requested. I'll admit that our workload is atypical, few users running relatively few but quite large jobs rather than the more typical scenario of lots of users running lots of short jobs. Regards, Martin From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Andrus, Brian Contractor Sent: 26 January 2013 01:27 To: Torque Users Mailing List Subject: Re: [torqueusers] Can we use torque 4.2.0 for single machine acting as both server and node Deepak, Yes it is possible. And if you install pbs_mom and pbs_server along with a scheduler (eg: pbs_sched) you should be just fine. Caveats: This in and of itself is not practical. If you are on the box, you should be aware of the resources and not need a scheduler at all. If others are also logging on, well, pbs will have no idea what resources they are using, so your box could be brought to its knees. I am assuming you are looking at this environment as a learning tool, not a production environment. That said, you may want to set your nodes file to specify 1 less core than there are on the box. This will at least ensure your jobs don't walk all over some of the other things you will have running. Same with memory, you may want to tell torque to manage less than there really is to prevent non-torque stuff from being walked on by torque stuff J Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Deepak Shingan Sent: Thursday, January 24, 2013 8:01 PM To: torqueusers at supercluster.org Subject: [torqueusers] Can we use torque 4.2.0 for single machine acting as both server and node Hi, I am new to torque and wanted to do some proof of concept around some requirement but I don't have cluster setup. So is it possible to run torque 4.2.0 on single machine with multiple core acting as both server and compute node. If yes what are the changes I need to do for setup? Regards, -Deepak DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130128/3cb1fd57/attachment.html From luiz at if.usp.br Mon Jan 28 07:22:51 2013 From: luiz at if.usp.br (Luiz Carlos dos Santos) Date: Mon, 28 Jan 2013 12:22:51 -0200 Subject: [torqueusers] error messages Message-ID: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> My system has put error messages of jobs in the ?/var/spool/torque/undelivered?, in the node where the job is running, instead of the local directory where the job is installed. Please, how can I solve this problem. Thanks, Luiz Carlos dos Santos Analista de Sistemas ? IFUSP/FMT Instituto de F?sica da USP Departamento de F?sica dos Materiais e Mec?nica P?a. do Oceanogr?fico - Trav E, s/n? Edif?cio Alessandro Volta, Bloco C - sala 112 CEP 05508-120 ? S?o Paulo SP Fone: (11) 3091-6784 / Fax: (11) 3091-6831 E-mail: luiz at if.usp.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130128/5748257b/attachment.html From roman.ricardo at gmail.com Mon Jan 28 07:38:30 2013 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Mon, 28 Jan 2013 08:38:30 -0600 Subject: [torqueusers] error messages In-Reply-To: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> References: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> Message-ID: Ola Luiz. we would need to see the config file and the log, since the files at undelivered are there for some reason that is shown there! :) On Mon, Jan 28, 2013 at 8:22 AM, Luiz Carlos dos Santos wrote: > My system has put error messages of jobs in the > ?/var/spool/torque/undelivered?, in the node where the job is running, > instead of the local directory where the job is installed. Please, how can > I solve this problem.**** > > ** ** > > Thanks, **** > > ** ** > > Luiz Carlos dos Santos > Analista de Sistemas ? IFUSP/FMT**** > > Instituto de F?sica da USP**** > > Departamento de F?sica dos Materiais e Mec?nica > P?a. do Oceanogr?fico - Trav E, s/n? > Edif?cio Alessandro Volta, Bloco C - sala 112 > CEP 05508-120 ? S?o Paulo SP > Fone: (11) 3091-6784 / Fax: (11) 3091-6831**** > > E-mail: luiz at if.usp.br**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130128/9e1b1bb8/attachment-0001.html From paradisehit at gmail.com Mon Jan 28 08:28:12 2013 From: paradisehit at gmail.com (shixing) Date: Mon, 28 Jan 2013 23:28:12 +0800 Subject: [torqueusers] error messages In-Reply-To: References: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> Message-ID: You can have a look at qstat -f $jobid, the $jobid is on of the jobs-id that in the "/var/spool/torque/undelivered" On Mon, Jan 28, 2013 at 10:38 PM, Ricardo Rom?n Brenes < roman.ricardo at gmail.com> wrote: > Ola Luiz. > > we would need to see the config file and the log, since the files at > undelivered are there for some reason that is shown there! > > :) > > > > On Mon, Jan 28, 2013 at 8:22 AM, Luiz Carlos dos Santos wrote: > >> My system has put error messages of jobs in the >> ?/var/spool/torque/undelivered?, in the node where the job is running, >> instead of the local directory where the job is installed. Please, how can >> I solve this problem.**** >> >> ** ** >> >> Thanks, **** >> >> ** ** >> >> Luiz Carlos dos Santos >> Analista de Sistemas ? IFUSP/FMT**** >> >> Instituto de F?sica da USP**** >> >> Departamento de F?sica dos Materiais e Mec?nica >> P?a. do Oceanogr?fico - Trav E, s/n? >> Edif?cio Alessandro Volta, Bloco C - sala 112 >> CEP 05508-120 ? S?o Paulo SP >> Fone: (11) 3091-6784 / Fax: (11) 3091-6831**** >> >> E-mail: luiz at if.usp.br**** >> >> ** ** >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Best wishes! My Friend~ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130128/1fb10b19/attachment.html From jjc at iastate.edu Mon Jan 28 13:20:36 2013 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 28 Jan 2013 20:20:36 +0000 Subject: [torqueusers] error messages In-Reply-To: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> References: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> Message-ID: <27743_1359404440_1359404440_242421BFAF465844BE24EB90BB97E2210FA361A8@ITSDAG1D.its.iastate.edu> Assuming you're the cluster admin, the easiest way is to place $spool_as_final_name true in /var/spool/torque/mom_priv/config on the batch nodes and restart the moms, then you don't have any output going to the modes. This also allows users to look at their output while the job runs. I do find that the output gets appended, so you might want to delete the output files before the job starts. This also avoids the problem of users filling up /var/spool/torque with junk. On my old smaller system, I had users writing 2GB plus output files. It didn't take many to fill up /var and then subsequent "good" jobs failed due to /var being full. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Luiz Carlos dos Santos Sent: Monday, January 28, 2013 8:23 AM To: torqueusers at supercluster.org Cc: luiz at if.usp.br Subject: [torqueusers] error messages My system has put error messages of jobs in the "/var/spool/torque/undelivered", in the node where the job is running, instead of the local directory where the job is installed. Please, how can I solve this problem. Thanks, Luiz Carlos dos Santos Analista de Sistemas - IFUSP/FMT Instituto de F?sica da USP Departamento de F?sica dos Materiais e Mec?nica P?a. do Oceanogr?fico - Trav E, s/n? Edif?cio Alessandro Volta, Bloco C - sala 112 CEP 05508-120 - S?o Paulo SP Fone: (11) 3091-6784 / Fax: (11) 3091-6831 E-mail: luiz at if.usp.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130128/fa068657/attachment.html From wytsang at clustertech.com Mon Jan 28 22:45:56 2013 From: wytsang at clustertech.com (Clotho Tsang) Date: Tue, 29 Jan 2013 13:45:56 +0800 Subject: [torqueusers] PBS server cannot allocate memory In-Reply-To: References: Message-ID: Problem solved. It's limited by "the size of virtual memory", ulimit -v. On 29 January 2013 10:31, Clotho Tsang wrote: > I would like to report a problem of Torque 4.1.3. > > PBS server log: > 01/29/2013 09:10:30;0001;PBS_Server.33181;Svr;PBS_Server;LOG_ERROR::Cannot > allocate memory (12) in DIS_tcp_setup, calloc failure > > (the above message repeats for several times) > > [root at mgmt server_logs]# free > > total used free shared buffers cached > > Mem: 32842192 30798872 2043320 0 347636 28189072 > > -/+ buffers/cache: 2262164 30580028 > > Swap: 16383992 2040 16381952 > -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: clotho at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130129/007cc4c2/attachment.html From kit at byu.net Tue Jan 29 11:07:58 2013 From: kit at byu.net (Kit Menlove) Date: Tue, 29 Jan 2013 12:07:58 -0600 Subject: [torqueusers] Issue copying OU file Message-ID: <001801cdfe4b$95c461d0$c14d2570$@byu.net> Hi all, I'm using a cluster that uses Torque as the batch system. About half of the time, checkpointing with DMTCP fails while copying the temporary output buffer/file: cp -f /var/spool/torque/spool/jobid.myserver.OU /checkpoint_dir/ckpt_myprog_52b886013bb1c112-27763-51060104_files/jobid.myse rver.OU_99001 I'm using dmtcp_checkpoint (v1.2.6) with the --checkpoint-open-files option. All I know is that the copy command fails, not why (though I know the destination directory exists and it does work about half the time). Can anyone explain why the OU file might not exist at the time of checkpointing, or what else might be the cause of the failure? Thanks, Kit -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130129/48b87fc2/attachment-0001.html From j.blank at fz-juelich.de Tue Jan 29 11:46:46 2013 From: j.blank at fz-juelich.de (Joerg Blank) Date: Tue, 29 Jan 2013 19:46:46 +0100 Subject: [torqueusers] Torque 4.1.4: Array job dependencies In-Reply-To: References: Message-ID: Hello everyone, We still have problems with this. Is there anything we can do to help debugging? Regards, Joerg Blank Am 11.01.2013 19:54, schrieb Joerg Blank: > Hello everyone, > > One of my users tried to use this script: > > i=$(qsub sleeper1.sh) > qsub -t 0-1 -W depend=afterany:$i sleeper2.sh > > This fails (silently to the user) after sleeper1.sh finished with the > log message: > 1/11/2013 19:43:50;0080;PBS_Server.697;Req;req_reject;Reject reply > code=15004(Invalid request MSG=Arrays may only be given array > dependencies), aux=0, type=RegisterDependency, from @cluster > > I do not see why an array job should not depend on a single job, as this > worked fine on 2.5.x > > Regards, > J?rg Blank > From dbeer at adaptivecomputing.com Tue Jan 29 11:49:44 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 29 Jan 2013 11:49:44 -0700 Subject: [torqueusers] Torque 4.1.4: Array job dependencies In-Reply-To: References: Message-ID: Joerg, Sorry for the lack of reply. Can you create a bugzilla on this? I can reproduce your issue, so no further information is needed, just a bugzilla to track it. We will try to get this fixed in 4.1.5. David On Tue, Jan 29, 2013 at 11:46 AM, Joerg Blank wrote: > Hello everyone, > > We still have problems with this. Is there anything we can do to help > debugging? > > Regards, > Joerg Blank > > > Am 11.01.2013 19:54, schrieb Joerg Blank: > > Hello everyone, > > > > One of my users tried to use this script: > > > > i=$(qsub sleeper1.sh) > > qsub -t 0-1 -W depend=afterany:$i sleeper2.sh > > > > This fails (silently to the user) after sleeper1.sh finished with the > > log message: > > 1/11/2013 19:43:50;0080;PBS_Server.697;Req;req_reject;Reject reply > > code=15004(Invalid request MSG=Arrays may only be given array > > dependencies), aux=0, type=RegisterDependency, from @cluster > > > > I do not see why an array job should not depend on a single job, as this > > worked fine on 2.5.x > > > > Regards, > > J?rg Blank > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130129/6a4c79ef/attachment.html From wytsang at clustertech.com Tue Jan 29 23:48:45 2013 From: wytsang at clustertech.com (Clotho Tsang) Date: Wed, 30 Jan 2013 14:48:45 +0800 Subject: [torqueusers] torque server on a virtual machine In-Reply-To: <50FDAD43.2020306@uni-konstanz.de> References: <50FDAD43.2020306@uni-konstanz.de> Message-ID: I also run Torque on KVM. It works. Check your /etc/hosts or DNS to see whether there is wrong mapping between hostnames and IPs. On 22 January 2013 05:04, Thomas Exner wrote: > Dear torque users and developers: > > I have a very strange problem. I set up a kernel-based virtual machine > (KVM) on an OpenSuse 12.1 system. This virtual machine should be used as > the torque server for a small cluster. Installation went fine but when > the first client tries to connect, I get thousands of these error massages: > PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect > from 10.0.2.2:1023 (address not trusted - check entry in > server_priv/nodes) > > The funny thing is that the IP address is not the IP of the client > (10.0.2.11) but of the host running the virtual machine (10.0.2.2). Thus > it seems that torque thinks that the request is coming from the > KVM-server host and not the torque client. But there is definitely no > pbs_mom running on 10.0.2.2. If 10.0.2.2 is added to the nodes file, > pbsnodes shows this machine but with the data of 10.0.2.11. This has > definitely something to do with the network setup to get KVM to work. > For this a network bridge is needed. Does anyone know if this bridge is > the problem and if I can configure it to send the original IP? > Unfortunately, my knowledge about such a setting is very limited. But > everything except torque seems to run nicely. Any other idea? > > Thank you very much. > Thomas > > -- > > ________________________________________________________________________________ > > Dr. Thomas E. Exner > Theoretische Pharmazeutische Chemie & Biophysik > Lehrstuhl Pharmazeutische Chemie > Pharmazeutisches Institut > Eberhard Karls Universit?t T?bingen > Auf der Morgenstelle 8 (Haus B) > 72076 T?bingen > Germany > > Tel.: +49-(0)7071-2976969 > Mobil: +49-(0)171-3807485 > Fax: +49-(0)7071-295637 > E-Mail: Thomas.Exner[at]uni-tuebingen.de > > Fachbereich Chemie und Zukunftskolleg > Universit?t Konstanz > 78457 Konstanz > Germany > > Tel.: +49-(0)7531-882015 > Fax: +49-(0)7531-883587 > E-Mail: Thomas.Exner[at]uni-konstanz.de > > ________________________________________________________________________________ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Clotho Tsang Senior Software Engineer Cluster Technology Limited Email: clotho at clustertech.com Tel: (852) 2655-6129 Fax: (852) 2994-2101 Website: www.clustertech.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/0db38b37/attachment.html From andre.gemuend at scai.fraunhofer.de Wed Jan 30 01:03:24 2013 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Wed, 30 Jan 2013 09:03:24 +0100 (CET) Subject: [torqueusers] torque server on a virtual machine In-Reply-To: <50FDAD43.2020306@uni-konstanz.de> Message-ID: <157445097.250463.1359533004610.JavaMail.root@scai.fraunhofer.de> Hi Thomas, you are probably using NAT forwarding (aka virbr0). See here how to configure the bridge: http://wiki.libvirt.org/page/Networking#Bridged_networking_.28aka_.22shared_physical_device.22.29 Greetings Andr? ----- Urspr?ngliche Mail ----- > Dear torque users and developers: > > I have a very strange problem. I set up a kernel-based virtual > machine > (KVM) on an OpenSuse 12.1 system. This virtual machine should be used > as > the torque server for a small cluster. Installation went fine but > when > the first client tries to connect, I get thousands of these error > massages: > PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to > connect > from 10.0.2.2:1023 (address not trusted - check entry in > server_priv/nodes) > > The funny thing is that the IP address is not the IP of the client > (10.0.2.11) but of the host running the virtual machine (10.0.2.2). > Thus > it seems that torque thinks that the request is coming from the > KVM-server host and not the torque client. But there is definitely no > pbs_mom running on 10.0.2.2. If 10.0.2.2 is added to the nodes file, > pbsnodes shows this machine but with the data of 10.0.2.11. This has > definitely something to do with the network setup to get KVM to work. > For this a network bridge is needed. Does anyone know if this bridge > is > the problem and if I can configure it to send the original IP? > Unfortunately, my knowledge about such a setting is very limited. But > everything except torque seems to run nicely. Any other idea? > > Thank you very much. > Thomas > > -- > ________________________________________________________________________________ > > Dr. Thomas E. Exner > Theoretische Pharmazeutische Chemie & Biophysik > Lehrstuhl Pharmazeutische Chemie > Pharmazeutisches Institut > Eberhard Karls Universit?t T?bingen > Auf der Morgenstelle 8 (Haus B) > 72076 T?bingen > Germany > > Tel.: +49-(0)7071-2976969 > Mobil: +49-(0)171-3807485 > Fax: +49-(0)7071-295637 > E-Mail: Thomas.Exner[at]uni-tuebingen.de > > Fachbereich Chemie und Zukunftskolleg > Universit?t Konstanz > 78457 Konstanz > Germany > > Tel.: +49-(0)7531-882015 > Fax: +49-(0)7531-883587 > E-Mail: Thomas.Exner[at]uni-konstanz.de > ________________________________________________________________________________ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From nt_mahmood at yahoo.com Wed Jan 30 04:41:12 2013 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 30 Jan 2013 03:41:12 -0800 (PST) Subject: [torqueusers] searcing for the directory of running job Message-ID: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> Dear all, How can I see the working directory of the running job on a a computing node based on the job ID. In condor, there is a command "condor_ssh_to_job ". Is there something similar in torque? ? Regards, Mahmood -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/493f7873/attachment.html From diego.bacchin at bmr-genomics.it Wed Jan 30 05:29:06 2013 From: diego.bacchin at bmr-genomics.it (diego bacchin) Date: Wed, 30 Jan 2013 13:29:06 +0100 Subject: [torqueusers] searcing for the directory of running job In-Reply-To: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> Message-ID: <51091212.2030709@bmr-genomics.it> Hi, qstat -f jobid | grep init_work_dir Bye Diego Bacchin IT System Administrator at BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it 366 72 97 232 Il 30/01/2013 12:41, Mahmood Naderan ha scritto: > Dear all, > How can I see the working directory of the running job on a a > computing node based on the job ID. In condor, there is a command " > ". Is there something similar in torque? > > Regards, > Mahmood* > * > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From nt_mahmood at yahoo.com Wed Jan 30 07:16:54 2013 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 30 Jan 2013 06:16:54 -0800 (PST) Subject: [torqueusers] searcing for the directory of running job In-Reply-To: <51091212.2030709@bmr-genomics.it> References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> <51091212.2030709@bmr-genomics.it> Message-ID: <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> There is no such variable mahmood at orca:~$ qstat 92285 Job id??????????????????? Name???????????? User??????????? Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 92285.hpclab?????????????? ...3-AG-l2pf mahmood???????? 01:43:31 R orcaq mahmood at orca:~$ qstat -f 92285.hpclab | grep init_work_dir qstat: illegally formed job identifier: 92285.hpclab mahmood at orca:~$ qstat -f 92285 | grep init_work_dir mahmood at orca:~$ ? Regards, Mahmood ________________________________ From: diego bacchin To: torqueusers at supercluster.org Sent: Wednesday, January 30, 2013 3:59 PM Subject: Re: [torqueusers] searcing for the directory of running job Hi, ? ? qstat -f jobid | grep init_work_dir Bye Diego Bacchin IT System Administrator at ? BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy ? CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it 366 72 97 232 Il 30/01/2013 12:41, Mahmood Naderan ha scritto: > Dear all, > How can I see the working directory of the running job on a a > computing node based on the job ID. In condor, there is a command " > ". Is there something similar in torque? > > Regards, > Mahmood* > * > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/6666bc41/attachment.html From cpaschoulas at gmail.com Wed Jan 30 07:51:37 2013 From: cpaschoulas at gmail.com (Chrysovalantis Paschoulas) Date: Wed, 30 Jan 2013 15:51:37 +0100 Subject: [torqueusers] searcing for the directory of running job In-Reply-To: <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> <51091212.2030709@bmr-genomics.it> <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> Message-ID: try this: $ qstat -f | grep -i workdir or just run $ qstat -f and search for the work dir.. best regards, chrys On Wed, Jan 30, 2013 at 3:16 PM, Mahmood Naderan wrote: > There is no such variable > > mahmood at orca:~$ qstat 92285 > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 92285.hpclab ...3-AG-l2pf mahmood 01:43:31 R orcaq > > mahmood at orca:~$ qstat -f 92285.hpclab | grep init_work_dir > qstat: illegally formed job identifier: 92285.hpclab > > mahmood at orca:~$ qstat -f 92285 | grep init_work_dir > mahmood at orca:~$ > > > > Regards, > Mahmood* > * > > ------------------------------ > *From:* diego bacchin > *To:* torqueusers at supercluster.org > *Sent:* Wednesday, January 30, 2013 3:59 PM > *Subject:* Re: [torqueusers] searcing for the directory of running job > > Hi, > qstat -f jobid | grep init_work_dir > Bye > > Diego Bacchin > IT System Administrator at > BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy > CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy > diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it > 366 72 97 232 > > Il 30/01/2013 12:41, Mahmood Naderan ha scritto: > > Dear all, > > How can I see the working directory of the running job on a a > > computing node based on the job ID. In condor, there is a command " > > ". Is there something similar in torque? > > > > Regards, > > Mahmood* > > * > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/c2910c21/attachment.html From knielson at adaptivecomputing.com Wed Jan 30 10:23:40 2013 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 30 Jan 2013 10:23:40 -0700 Subject: [torqueusers] error messages In-Reply-To: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> References: <009301cdfd62$f53b02c0$dfb10840$@if.usp.br> Message-ID: Luiz, The stdout and stderr files end up in the undelivered directory when the MOM the job was running on could not successfully scp the files back to the submit host. Ken On Mon, Jan 28, 2013 at 7:22 AM, Luiz Carlos dos Santos wrote: > My system has put error messages of jobs in the > ?/var/spool/torque/undelivered?, in the node where the job is running, > instead of the local directory where the job is installed. Please, how can > I solve this problem.**** > > ** ** > > Thanks, **** > > ** ** > > Luiz Carlos dos Santos > Analista de Sistemas ? IFUSP/FMT**** > > Instituto de F?sica da USP**** > > Departamento de F?sica dos Materiais e Mec?nica > P?a. do Oceanogr?fico - Trav E, s/n? > Edif?cio Alessandro Volta, Bloco C - sala 112 > CEP 05508-120 ? S?o Paulo SP > Fone: (11) 3091-6784 / Fax: (11) 3091-6831**** > > E-mail: luiz at if.usp.br**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/6487359d/attachment.html From diego.bacchin at bmr-genomics.it Wed Jan 30 14:18:03 2013 From: diego.bacchin at bmr-genomics.it (Diego Bacchin) Date: Wed, 30 Jan 2013 22:18:03 +0100 Subject: [torqueusers] searcing for the directory of running job In-Reply-To: References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> <51091212.2030709@bmr-genomics.it> <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> Message-ID: Post us the output of qstat -f jobid -- Diego Bacchin Il giorno 30/gen/2013, alle ore 15:51, Chrysovalantis Paschoulas ha scritto: > try this: > > $ qstat -f | grep -i workdir > > or just run > > $ qstat -f > > and search for the work dir.. > > best regards, > chrys > > > On Wed, Jan 30, 2013 at 3:16 PM, Mahmood Naderan wrote: >> There is no such variable >> >> mahmood at orca:~$ qstat 92285 >> Job id Name User Time Use S Queue >> ------------------------- ---------------- --------------- -------- - ----- >> 92285.hpclab ...3-AG-l2pf mahmood 01:43:31 R orcaq >> >> mahmood at orca:~$ qstat -f 92285.hpclab | grep init_work_dir >> qstat: illegally formed job identifier: 92285.hpclab >> >> mahmood at orca:~$ qstat -f 92285 | grep init_work_dir >> mahmood at orca:~$ >> >> >> >> Regards, >> Mahmood >> >> From: diego bacchin >> To: torqueusers at supercluster.org >> Sent: Wednesday, January 30, 2013 3:59 PM >> Subject: Re: [torqueusers] searcing for the directory of running job >> >> Hi, >> qstat -f jobid | grep init_work_dir >> Bye >> >> Diego Bacchin >> IT System Administrator at >> BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy >> CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy >> diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it >> 366 72 97 232 >> >> Il 30/01/2013 12:41, Mahmood Naderan ha scritto: >> > Dear all, >> > How can I see the working directory of the running job on a a >> > computing node based on the job ID. In condor, there is a command " >> > ". Is there something similar in torque? >> > >> > Regards, >> > Mahmood* >> > * >> > >> > >> > _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/39a54dd4/attachment-0001.html From nt_mahmood at yahoo.com Thu Jan 31 00:39:07 2013 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Wed, 30 Jan 2013 23:39:07 -0800 (PST) Subject: [torqueusers] searcing for the directory of running job In-Reply-To: References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> <51091212.2030709@bmr-genomics.it> <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> Message-ID: <1359617947.86616.YahooMailNeo@web163004.mail.bf1.yahoo.com> Below is the output of "qstat -f". Please note that I am not looking for PBS_O_WORKDIR. That is the working directory which I ran qsub. What I want to find is the temporary directory on the computing node which is running the executable. Assume, 1- I compile the program 2- qsub the program 3- while '2' is running, I modify the code 4- compile the code 5- qsub the program again Now 2 instances of my program are running however they are independent. So torque should have copied the executables somewhere on the computing nodes to provide this independence. I want to find that location. ? root at orca:/home/mahmood# qstat -f 92322 Job Id: 92322.hpclab.orca ??? Job_Name = sample_machine-SET2-4core-MILC-LESLIED3D-GEMSFDTD-LBM.simics ??? Job_Owner = ali at hpclab.orca ??? resources_used.cput = 12:54:55 ??? resources_used.mem = 11203852kb ??? resources_used.vmem = 11568104kb ??? resources_used.walltime = 09:35:49 ??? job_state = R ??? queue = orcaq ??? server = hpclab.orca ??? Checkpoint = u ??? ctime = Thu Jan 31 01:25:32 2013 ??? Error_Path = hpclab.orca:/home/ali/lu/work_update/sample_machine-S ??????? ET2-4core-MILC-LESLIED3D-GEMSFDTD-LBM.simics.e92322 ??? exec_host = ws05/6 ??? exec_port = 15003 ??? Hold_Types = n ??? Join_Path = oe ??? Keep_Files = n ??? Mail_Points = a ??? mtime = Thu Jan 31 01:25:33 2013 ??? Output_Path = orca:/home/ali/lu/work_update/tor-reports/SET2-4core ??????? -MILC-LESLIED3D-GEMSFDTD-LBM.simics.tor.rep ??? Priority = 0 ??? qtime = Thu Jan 31 01:25:32 2013 ??? Rerunable = True ??? Resource_List.neednodes = ws05 ??? Resource_List.nodect = 1 ??? Resource_List.nodes = ws05 ??? Resource_List.walltime = 960:00:00 ??? session_id = 24161 ??? substate = 42 ??? Variable_List = PBS_O_QUEUE=orcaq,PBS_O_HOME=/home/ali, ??????? PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=ali, ??????? PBS_O_PATH=/home/ali/lu/msim/bin:/home/ali/lu/workdir ??????? :/home/ali/lu/msim/bin:/home/ali/lu/workdir:/opt/mpich ??????? 2/bin:/opt/mpiexec/bin:/opt/mpich2/bin:/opt/mpiexec/bin:/usr/local/mau ??????? i/sbin:/usr/local/maui/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/u ??????? sr/bin:/sbin:/bin:/usr/bin/X11:/usr/games, ??????? PBS_O_MAIL=/var/mail/ali,PBS_O_SHELL=/bin/bash, ??????? PBS_O_HOST=hpclab.orca,PBS_SERVER=hpclab.orca, ??????? PBS_O_WORKDIR=/home/ali/lu/work_update,SHELL=/bin/bash, ??????? TERM=xterm, ??????? XDG_SESSION_COOKIE=de692068037f222a35a95f874dc6aed9-1359565750.805643 ??????? -944328263,SSH_CLIENT=213.233.182.203 9134 22,SSH_TTY=/dev/pts/5, ??????? USER=ali, ??????? LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd= ??????? 40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:o ??????? w=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01 ??????? ;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=0 ??????? 1;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31 ??????? :*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;3 ??????? 1:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.ace=01 ??????? ;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg= ??????? 01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tg ??????? a=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:* ??????? .svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;3 ??????? 5:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4= ??????? 01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wm ??????? v=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*. ??????? fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*. ??????? yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35: ??????? *.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00; ??????? 36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00 ??????? ;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:, ??????? PATH=/home/ali/lu/msim/bin:/home/ali/lu/workdir:/home ??????? /ali/lu/msim/bin:/home/ali/lu/workdir:/opt/mpich2/bin: ??????? /opt/mpiexec/bin:/opt/mpich2/bin:/opt/mpiexec/bin:/usr/local/maui/sbin ??????? :/usr/local/maui/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin ??????? :/sbin:/bin:/usr/bin/X11:/usr/games,MAIL=/var/mail/ali, ??????? SIMICS_INSTALL=/home/ali/gems/simics-3.0.30, ??????? PWD=/home/ali/lu/work_update,LANG=en_US.UTF-8, ??????? GEMS=/home/ali/gems,HOME=/home/ali,SHLVL=2, ??????? BASH_ENV=~/.profile,LOGNAME=ali,PYTHONPATH=./modules, ??????? SSH_CONNECTION=213.233.182.203 9134 194.225.69.105 22, ??????? LESSOPEN=| /usr/bin/lesspipe %s,SIMICS_EXTRA_LIB=./modules, ??????? LESSCLOSE=/usr/bin/lesspipe %s %s,_=/usr/local/bin/qsub ??? euser = ali ??? egroup = users ??? hashname = 92322.hpclab.orca ??? queue_rank = 27317 ??? queue_type = E ??? etime = Thu Jan 31 01:25:32 2013 ??? submit_args = tor-files/SET2-4core-MILC-LESLIED3D-GEMSFDTD-LBM.simics.tor ??? start_time = Thu Jan 31 01:25:33 2013 ??? Walltime.Remaining = 3421384 ??? start_count = 1 ??? fault_tolerant = False Regards, Mahmood ________________________________ From: Diego Bacchin To: Torque Users Mailing List Cc: Mahmood Naderan ; Torque Users Mailing List Sent: Thursday, January 31, 2013 12:48 AM Subject: Re: [torqueusers] searcing for the directory of running job Post us the output of? qstat -f jobid -- Diego Bacchin Il giorno 30/gen/2013, alle ore 15:51, Chrysovalantis Paschoulas ha scritto: try this: > >$ qstat -f | grep -i workdir > >or just run > >$ qstat -f > >and search for the work dir.. > >best regards, >chrys > > > > >On Wed, Jan 30, 2013 at 3:16 PM, Mahmood Naderan wrote: > >There is no such variable >> >>mahmood at orca:~$ qstat 92285 >>Job id??????????????????? Name???????????? User??????????? Time Use S Queue >>------------------------- ---------------- --------------- -------- - ----- >>92285.hpclab?????????????? ...3-AG-l2pf mahmood???????? 01:43:31 R orcaq >> >>mahmood at orca:~$ qstat -f 92285.hpclab | grep init_work_dir >>qstat: illegally formed job identifier: 92285.hpclab >> >>mahmood at orca:~$ qstat -f 92285 | grep init_work_dir >>mahmood at orca:~$ >> >> >> >> >>? >>Regards, >>Mahmood >> >> >> >> >>________________________________ >> From: diego bacchin >>To: torqueusers at supercluster.org >>Sent: Wednesday, January 30, 2013 3:59 PM >>Subject: Re: [torqueusers] searcing for the directory of running job >> >> >>Hi, >>? ? qstat -f jobid | grep init_work_dir >>Bye >> >>Diego Bacchin >>IT System Administrator at >>? BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy >>? CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy >>diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it >>366 72 97 232 >> >>Il 30/01/2013 12:41, Mahmood Naderan ha scritto: >>> Dear all, >>> How can I see the working directory of the running job on a a >>> computing node based on the job ID. In condor, there is a command " >>> ". Is there something similar in torque? >>> >>> Regards, >>> Mahmood* >>> * >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >>_______________________________________________ >>torqueusers mailing list >>torqueusers at supercluster.org >>http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >>_______________________________________________ >>torqueusers mailing list >>torqueusers at supercluster.org >>http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > _______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130130/f7cfc818/attachment.html From mej at lbl.gov Thu Jan 31 10:39:01 2013 From: mej at lbl.gov (Michael Jennings) Date: Thu, 31 Jan 2013 09:39:01 -0800 Subject: [torqueusers] searcing for the directory of running job In-Reply-To: <1359617947.86616.YahooMailNeo@web163004.mail.bf1.yahoo.com> References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> <51091212.2030709@bmr-genomics.it> <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> <1359617947.86616.YahooMailNeo@web163004.mail.bf1.yahoo.com> Message-ID: <20130131173900.GR8827@lbl.gov> On Wednesday, 30 January 2013, at 23:39:07 (-0800), Mahmood Naderan wrote: > Below is the output of "qstat -f". Please note that I am not looking for PBS_O_WORKDIR. That is the working directory which I ran qsub. What I want to find is the temporary directory on the computing node which is running the executable. Assume, > 1- I compile the program > 2- qsub the program > 3- while '2' is running, I modify the code > 4- compile the code > 5- qsub the program again > > Now 2 instances of my program are running however they are independent. So torque should have copied the executables somewhere on the computing nodes to provide this independence. I want to find that location. TORQUE copies job scripts but does NOT copy executables. The Linux kernel on the compute node(s) keeps the text and data segments of the executable in memory even after it is overwritten. lsof -p is very informative in this regard. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From ajdecon at ajdecon.org Wed Jan 30 17:33:29 2013 From: ajdecon at ajdecon.org (Adam DeConinck) Date: Wed, 30 Jan 2013 16:33:29 -0800 Subject: [torqueusers] Job fails because prologue.user doesn't exist Message-ID: Hey all, Running into an odd issue with Torque 4.1.0 and wondering if anyone has ideas. I recently built a new image for our compute nodes, with the only change being (AFAICT) installing the newest Mellanox OFED. The Torque RPMs are the same in both images. However, nodes which boot into this image fail to start any jobs which run on them, for reasons that don't seem to relate to the IB. Instead, the logs seem to indicate that they are failing because of the lack of a mom_priv/prologue.user file. This confuses me because (a) we have never had such a file, and (b) if I go in and create empty prologue.user and epilogue.user scripts, it still fails with the exact same messages! >From mom_log: 01/30/2013 16:20:12;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.1.0, loglevel = 0 01/30/2013 16:22:48;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry (see syslog for more information) 01/30/2013 16:22:48;0001; pbs_mom;Job;7408.wwmaster.psg.cluster.zone;ALERT: job failed phase 3 start 01/30/2013 16:22:48;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 7408.wwmaster.psg.cluster.zone 01/30/2013 16:22:48;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::run_pelog, prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.user, exit: 13, cannot stat 01/30/2013 16:22:48;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::run_epilogues, user epilog failed - interactive job 01/30/2013 16:22:48;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 01/30/2013 16:22:48;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 01/30/2013 16:22:48;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 01/30/2013 16:22:48;0080; pbs_mom;Job;7408.wwmaster.psg.cluster.zone;obit sent to server >From syslog: Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_pelog, prolog/epilog failed, file: /var/spool/torque/mom_priv/prologue.user, exit: 13, cannot stat Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::handle_prologs, user prolog failed Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_pelog, prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.user, exit: 13, cannot stat Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_epilogues, user epilog failed - interactive job Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::run_pelog, prolog/epilog failed, file: /var/spool/torque/mom_priv/epilogue.user, exit: 13, cannot stat Jan 30 16:22:48 wm030 pbs_mom: LOG_ERROR::preobit_reply, user epilog failed - interactive job It's worth noting that in the old image, the prologue.user and epilogue.user messages are not even printed; the job just works. Any ideas? Thank you! Adam From shantanugadgil at yahoo.com Thu Jan 31 00:48:45 2013 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Wed, 30 Jan 2013 23:48:45 -0800 (PST) Subject: [torqueusers] submit job as root in TORQUE 4 Message-ID: <1359618525.56178.YahooMailClassic@web141005.mail.bf1.yahoo.com> Hi, I have a few question regarding job submission as root on CentOS 6 x86_64: *) Is it possible *at all* to submit jobs as root in TORQUE 4.xx. ???I keep getting the message 'root' is not allowed ... ???Last I'd checked was at version 4.1.x (I don't exactly remember which one) *) Also, I think (Ken) replied that there was a known issue how CentOS uses getaddrinfo (or maybe getnameinfo) which causes this problem. Is there a workaround for this on CentOS ? I know I have asked this before, just was wondering if this is planned at all, or should I just "give up" :) :) on the 'submit as root' thought! Thanks and Regards, Shantanu From stucki at mi.fu-berlin.de Thu Jan 31 14:07:53 2013 From: stucki at mi.fu-berlin.de (Christoph (Stucki) von Stuckrad) Date: Thu, 31 Jan 2013 22:07:53 +0100 Subject: [torqueusers] searcing for the directory of running job In-Reply-To: <20130131173900.GR8827@lbl.gov> References: <1359546072.73420.YahooMailNeo@web163005.mail.bf1.yahoo.com> <51091212.2030709@bmr-genomics.it> <1359555414.70737.YahooMailNeo@web163002.mail.bf1.yahoo.com> <1359617947.86616.YahooMailNeo@web163004.mail.bf1.yahoo.com> <20130131173900.GR8827@lbl.gov> Message-ID: <20130131210753.GU3863@localhost.mi.fu-berlin.de> On Thu, 31 Jan 2013, Michael Jennings wrote: > TORQUE copies job scripts but does NOT copy executables. The Linux > kernel on the compute node(s) keeps the text and data segments of the > executable in memory even after it is overwritten. This depends on how you submit things. Nobody keeps you from explicitly sending the executables together with the job into the job's spool area via the 'stagein=...files...' statement. (which would be the preferred way to run a binary which must be compiled for each job and will change while jobs run). But if you work with a shared filesystem for all the cluster-nodes, the behavior of the nodes can be erratic if you switch binaries on running processes. 'File deletes' on a local disk and via NFS have different semantics. The moment, when the node 'notices', that a shared binary has been changed may vary in time (we've seen >15Min latencies in LINUX NFS! Having changed the file on the server, the client did not notice the change for greater than 15 Minutes!) Also the dynamic loading of shared libraries may bring different 'compiles' together (old binary runs, but later loads newer library, which may have side-effects). So we warn our users NEVER to change a binary, while a job still runs it, and instead to install different variants in different places in that case. Stucki -- Christoph von Stuckrad * * |nickname |Mail \ Freie Universitaet Berlin |/_*|'stucki' |Tel(Mo.,Mi.):+49 30 838-75 459| Mathematik & Informatik EDV |\ *|if online| (Di,Do,Fr):+49 30 77 39 6600| Takustr. 9 / 14195 Berlin * * |on IRCnet|Fax(home): +49 30 77 39 6601/ From pc at pcable.net Thu Jan 31 16:14:41 2013 From: pc at pcable.net (Patrick Cable) Date: Thu, 31 Jan 2013 18:14:41 -0500 Subject: [torqueusers] Bug reporting location Message-ID: I created an issue report at https://github.com/adaptivecomputing/torque/issues/27 on the torque github page, as I was told ( https://twitter.com/AdaptiveMoab/status/288900453696147456) that Github was the proper location for that kind of thing. Should I submit it somewhere else to have someone look at it? - Patrick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130131/97d40603/attachment.html From dbeer at adaptivecomputing.com Thu Jan 31 16:36:49 2013 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 31 Jan 2013 16:36:49 -0700 Subject: [torqueusers] Bug reporting location In-Reply-To: References: Message-ID: Patrick, This is where we want people to report things. We would like to transition to reporting them there instead of bugzilla. Its always a good idea to send an email to the user's list to get someone to look at it though. David On Thu, Jan 31, 2013 at 4:14 PM, Patrick Cable wrote: > I created an issue report at > https://github.com/adaptivecomputing/torque/issues/27 on the torque > github page, as I was told ( > https://twitter.com/AdaptiveMoab/status/288900453696147456) that Github > was the proper location for that kind of thing. > > Should I submit it somewhere else to have someone look at it? > > - Patrick > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Senior Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130131/07aa0902/attachment.html From soubari at yahoo.com Thu Jan 31 21:24:45 2013 From: soubari at yahoo.com (Sam Oubari) Date: Thu, 31 Jan 2013 20:24:45 -0800 (PST) Subject: [torqueusers] Broken Scheduling Puzzle In-Reply-To: <1337436056.8485.YahooMailNeo@web110608.mail.gq1.yahoo.com> References: <1316662124.15523.YahooMailNeo@web110606.mail.gq1.yahoo.com> <1316707248.49614.YahooMailNeo@web110613.mail.gq1.yahoo.com> <1337436056.8485.YahooMailNeo@web110608.mail.gq1.yahoo.com> Message-ID: <1359692685.93194.YahooMailNeo@web122501.mail.ne1.yahoo.com> Hello, I have one but very busy PROD PBS 2.5.11 on x86-64?server using a local pbs_sched and all components and clients are running on the same server. ? Currently, if I have jobs from the day before waiting for future exec date, and I qsub a new job with a future time, then qmove it to another queue and qalter it to?a more distant future date, then some of the waiting jobs move to Q status?at exec time but don't run. The moved job and all other jobs run as they should. All the jobs on this server are generally executing simple scripts.? Normally, since 2.4.x, this problem shows up rarely and randomly, but for about a week now, I can re-produce on demand but not on my TEST server. ? Any ideas what I should try? ? Sam. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130131/0a3caade/attachment.html