From Gareth.Williams at csiro.au Sun Apr 1 19:36:12 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 2 Apr 2012 11:36:12 +1000 Subject: [torqueusers] nodes file persistent gpus setting Message-ID: <007DECE986B47F4EABF823C1FBB19C620102DCC742DB@exvic-mbx04.nexus.csiro.au> Hi, Can anyone confirm the following behavior (bug)? If you give a node gpus like so: qmgr -c 'set node gpunode01 gpus = 2' or in the nodes file gpunode01 np=12 gpus=2 Then the node has (logical) gpus defined and they can be scheduled as in: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php (though 1.5.3 doesn't mention specifying both np= and gpus= which I suspect needs fixing). This setup works fine for us until we restart the pbs_server at which time the gpus disappear (you can see this in the output of pbsnodes). The nodes file gets altered to remove the gpus= setting. Note that we are using version 3.0.3-snap.xxx and NOT the integrated nvidia gpu support. Does anyone else see the behavior? You don't need physical gpus to test, just a system you are prepared to mess with a little including restarting the pbs_server. Regards, Gareth From samuel at unimelb.edu.au Sun Apr 1 20:28:49 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 02 Apr 2012 12:28:49 +1000 Subject: [torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer In-Reply-To: <4F73A74D.7000105@ldeo.columbia.edu> References: <4F73A74D.7000105@ldeo.columbia.edu> Message-ID: <4F790EE1.7090905@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 29/03/12 11:05, Gus Correa wrote: > Actually, most file names there seem to have benefited from this > wonderful prefix "cpuset." Oh, well, innovation has no bounds . It looks like RHEL6 defaults to mounting the cgroup filesystem (not the cpuset one) at that location. You should be able to mount the cpuset one there instead and symlink it to where Torque is expecting it. - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk95DuAACgkQO2KABBYQAh/KjQCfZdekXPDWwdt+Qsc8oPV4plo/ rPAAn20c2oijK/8axMGaKoYnK1eMHsx+ =MqA0 -----END PGP SIGNATURE----- From samuel at unimelb.edu.au Sun Apr 1 20:30:21 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Mon, 02 Apr 2012 12:30:21 +1000 Subject: [torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer In-Reply-To: <1831500498.60711294.1333146486889.JavaMail.root@zm09.stanford.edu> References: <1831500498.60711294.1333146486889.JavaMail.root@zm09.stanford.edu> Message-ID: <4F790F3D.5000307@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 31/03/12 09:28, David Gabriel Simas wrote: > mount -t cgroup -o cupset,noprefix X /sys/fs/cgroup/cpuset Or just: mount -t cpuset - /sys/fs/cgroup/cpuset > However, torque doesn't seem to understand the difference between > cores and hyperthreads. With cpusets enabled, a one processor job > will be bound to a core, giving it two hyperthreads. Just disable it in your BIOS/EFI - that's what we've been doing since HT appeared on Intel clusters (at least on PowerPC you could disable SMT at runtime). cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk95Dz0ACgkQO2KABBYQAh8TRgCdHXtphic58cmPShwW3Gz8CqGJ BDsAn0iHJulxDKsvg28aBd0PT5qjWErT =Sggl -----END PGP SIGNATURE----- From stevenx.a.duchene at intel.com Mon Apr 2 09:58:21 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 2 Apr 2012 15:58:21 +0000 Subject: [torqueusers] torque svn access? In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0123@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFED5@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0123@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0365@ORSMSX106.amr.corp.intel.com> Ken: Has there been any progress on getting the torque svn server up and available again? -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120402/fe2298b7/attachment.html From samuel at unimelb.edu.au Mon Apr 2 19:52:53 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 03 Apr 2012 11:52:53 +1000 Subject: [torqueusers] torque svn access? In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0365@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFED5@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0123@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0365@ORSMSX106.amr.corp.intel.com> Message-ID: <4F7A57F5.6090007@unimelb.edu.au> On 03/04/12 01:58, DuChene, StevenX A wrote: > Has there been any progress on getting the torque svn server up and > available again? It's working now (announced on torquedev list) - apparently a firewall issue - I can get to it from here now. cheers! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From stevenx.a.duchene at intel.com Mon Apr 2 23:06:50 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 05:06:50 +0000 Subject: [torqueusers] maui and torque not communicating In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC86B@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD086@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD1FF@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABD497@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0590@ORSMSX106.amr.corp.intel.com> This evening I was able to update all of my 264 test systems to the torque 4.0-fixes stuff and after messing around with log levels all of a sudden maui and torque seem to be communicating and processing jobs. Unfortunately I have no idea if the changes in torque 4.0-fixes tree did the trick or what. When I first brought it all up but before messing around with log levels, it seemed to be in the same non-communication state between maui and torque server. It is working now though as far as I can tell. I will continue to test with the various test jobs I typically run on a new server. -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/68df14cd/attachment-0001.html From stevenx.a.duchene at intel.com Mon Apr 2 23:08:50 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 05:08:50 +0000 Subject: [torqueusers] torque svn access? In-Reply-To: <4F7A57F5.6090007@unimelb.edu.au> References: <560DBE57F33C4C4C9FBF11C662951AF805ABFE4C@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFED5@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805ABFF3E@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0073@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0123@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0365@ORSMSX106.amr.corp.intel.com> <4F7A57F5.6090007@unimelb.edu.au> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC05A4@ORSMSX106.amr.corp.intel.com> Yes, I was able to pull the new sources this afternoon and this evening I got all of my 264 test systems updated to the new code. After fumbling around a bit it does seem like maui and torque are communicating now and properly processing jobs. Yahoo! Thanks to all of the folks implementing fixes and features in the torque-4.X tree! -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Christopher Samuel Sent: Monday, April 02, 2012 6:53 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] torque svn access? On 03/04/12 01:58, DuChene, StevenX A wrote: > Has there been any progress on getting the torque svn server up and > available again? It's working now (announced on torquedev list) - apparently a firewall issue - I can get to it from here now. cheers! Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From martinliuhao at gmail.com Tue Apr 3 07:25:13 2012 From: martinliuhao at gmail.com (=?UTF-8?B?5YiY5rWp?=) Date: Tue, 3 Apr 2012 21:25:13 +0800 Subject: [torqueusers] A install problem Message-ID: I met a problem when I installed the torque. Does any one can help me? I run $./configure $./make install and get [.....] make[3]: Entering directory `/home/martinliu/soft/torque/src/daemon_client' /bin/sh ../../libtool --tag=CC --mode=link gcc -Wall -pthread -ldl -lrt -lssl -lcrypto -g -O2 -L../lib/Libpbs/.libs -ltorque -o trqauthd trq_auth_daemon.o trq_main.o -lpthread -lrt libtool: link: gcc -Wall -pthread -g -O2 -o .libs/trqauthd trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto -L/home/martinliu/soft/torque/src/lib/Libpbs/.libs /home/martinliu/soft/torque/src/lib/Libpbs/.libs/libtorque.so -lpthread -lrt -pthread /usr/bin/ld: cannot find -lssl /usr/bin/ld: cannot find -lcrypto collect2: ld returned 1 exit status make[3]: *** [trqauthd] Error 1 make[3]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' make[2]: *** [install-recursive] Error 1 make[2]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' make[1]: *** [install-recursive] Error 1 make[1]: Leaving directory `/home/martinliu/soft/torque/src' make: *** [install-recursive] Error 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/9c2e3789/attachment.html From stevenx.a.duchene at intel.com Tue Apr 3 08:43:18 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 14:43:18 +0000 Subject: [torqueusers] A install problem In-Reply-To: References: Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> I believe somewhere in the release notes it specifically states that your need openssl-devel installed. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of ?? Sent: Tuesday, April 03, 2012 6:25 AM To: torqueusers at supercluster.org Subject: [torqueusers] A install problem I met a problem when I installed the torque. Does any one can help me? I run $./configure $./make install and get [.....] make[3]: Entering directory `/home/martinliu/soft/torque/src/daemon_client' /bin/sh ../../libtool --tag=CC --mode=link gcc -Wall -pthread -ldl -lrt -lssl -lcrypto -g -O2 -L../lib/Libpbs/.libs -ltorque -o trqauthd trq_auth_daemon.o trq_main.o -lpthread -lrt libtool: link: gcc -Wall -pthread -g -O2 -o .libs/trqauthd trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto -L/home/martinliu/soft/torque/src/lib/Libpbs/.libs /home/martinliu/soft/torque/src/lib/Libpbs/.libs/libtorque.so -lpthread -lrt -pthread /usr/bin/ld: cannot find -lssl /usr/bin/ld: cannot find -lcrypto collect2: ld returned 1 exit status make[3]: *** [trqauthd] Error 1 make[3]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' make[2]: *** [install-recursive] Error 1 make[2]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' make[1]: *** [install-recursive] Error 1 make[1]: Leaving directory `/home/martinliu/soft/torque/src' make: *** [install-recursive] Error 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/213651c8/attachment.html From dbeer at adaptivecomputing.com Tue Apr 3 09:00:16 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 09:00:16 -0600 Subject: [torqueusers] A install problem In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> Message-ID: Yes, and in 4.0.1 configure will check for these required libraries and give you semi-intelligent messages about what you should get. David 2012/4/3 DuChene, StevenX A > I believe somewhere in the release notes it specifically states that > your need openssl-devel installed.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *?? > *Sent:* Tuesday, April 03, 2012 6:25 AM > *To:* torqueusers at supercluster.org > *Subject:* [torqueusers] A install problem**** > > ** ** > > I met a problem when I installed the torque.**** > > Does any one can help me?**** > > ** ** > > I run**** > > $./configure**** > > $./make install**** > > ** ** > > and get**** > > ** ** > > [.....]**** > > make[3]: Entering directory `/home/martinliu/soft/torque/src/daemon_client' > **** > > /bin/sh ../../libtool --tag=CC --mode=link gcc -Wall -pthread -ldl -lrt > -lssl -lcrypto -g -O2 -L../lib/Libpbs/.libs -ltorque -o trqauthd > trq_auth_daemon.o trq_main.o -lpthread -lrt**** > > libtool: link: gcc -Wall -pthread -g -O2 -o .libs/trqauthd > trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto > -L/home/martinliu/soft/torque/src/lib/Libpbs/.libs > /home/martinliu/soft/torque/src/lib/Libpbs/.libs/libtorque.so -lpthread > -lrt -pthread**** > > /usr/bin/ld: cannot find -lssl**** > > /usr/bin/ld: cannot find -lcrypto**** > > collect2: ld returned 1 exit status**** > > make[3]: *** [trqauthd] Error 1**** > > make[3]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' > **** > > make[2]: *** [install-recursive] Error 1**** > > make[2]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' > **** > > make[1]: *** [install-recursive] Error 1**** > > make[1]: Leaving directory `/home/martinliu/soft/torque/src'**** > > make: *** [install-recursive] Error 1**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/21886cdd/attachment-0001.html From martinliuhao at gmail.com Tue Apr 3 09:11:29 2012 From: martinliuhao at gmail.com (Martin Liu) Date: Tue, 03 Apr 2012 23:11:29 +0800 Subject: [torqueusers] A install problem In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> Message-ID: <4F7B1321.8010104@gmail.com> On 2012?04?03? 22:43, DuChene, StevenX A wrote: > > I believe somewhere in the release notes it specifically states that > your need openssl-devel installed. > > -- > > Steven DuChene > > *From:*torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] *On Behalf Of *?? > *Sent:* Tuesday, April 03, 2012 6:25 AM > *To:* torqueusers at supercluster.org > *Subject:* [torqueusers] A install problem > > I met a problem when I installed the torque. > > Does any one can help me? > > I run > > $./configure > > $./make install > > and get > > [.....] > > make[3]: Entering directory > `/home/martinliu/soft/torque/src/daemon_client' > > /bin/sh ../../libtool --tag=CC --mode=link gcc -Wall -pthread -ldl > -lrt -lssl -lcrypto -g -O2 -L../lib/Libpbs/.libs -ltorque -o trqauthd > trq_auth_daemon.o trq_main.o -lpthread -lrt > > libtool: link: gcc -Wall -pthread -g -O2 -o .libs/trqauthd > trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto > -L/home/martinliu/soft/torque/src/lib/Libpbs/.libs > /home/martinliu/soft/torque/src/lib/Libpbs/.libs/libtorque.so > -lpthread -lrt -pthread > > /usr/bin/ld: cannot find -lssl > > /usr/bin/ld: cannot find -lcrypto > > collect2: ld returned 1 exit status > > make[3]: *** [trqauthd] Error 1 > > make[3]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' > > make[2]: *** [install-recursive] Error 1 > > make[2]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' > > make[1]: *** [install-recursive] Error 1 > > make[1]: Leaving directory `/home/martinliu/soft/torque/src' > > make: *** [install-recursive] Error 1 > Oh, I forget to read the release note and read Administrator's_Guide only. Anyway, I do appreciate your kind help. All the best, Hao Liu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/08fb248d/attachment.html From dbeer at adaptivecomputing.com Tue Apr 3 09:12:16 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 09:12:16 -0600 Subject: [torqueusers] Can't submit job from remote submit host In-Reply-To: <4F7693A9.9080501@ssau.ru> References: <4F7693A9.9080501@ssau.ru> Message-ID: What does your /etc/hosts.equiv file look like? You may need an entry that is something like: + or The + just allows all users. Note that this check is in addition to your TORQUE settings so having a plus there doesn't make everyone a manager or anything crazy like that. David On Fri, Mar 30, 2012 at 11:18 PM, Alexandr Baskakov wrote: > Hi, All. > > I'am trying to submit job from submit host to remote server with torque. > > Have 2 nodes: > mgt1 - torque client > mgt2 - torque server and moab. > > Domain: ssc > > On mgt2: > [mgt2 ~]$ qmgr -c 'l s' > Server mgt2 > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 > Exiting:0 > acl_hosts = localhost,mgt2,mgt1 > managers = root at mgt2,torque at mgt2 > operators = root at mgt2,torque at mgt2 > default_queue = batch > log_events = 511 > mail_from = adm > query_other_jobs = True > resources_assigned.ncpus = 0 > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > log_level = 7 > mom_job_sync = True > pbs_version = 3.0.2 > keep_completed = 300 > submit_hosts = mgt1.ssc > next_job_number = 57 > net_counter = 2 0 0 > > When I trying to submit job from mgt1 by: > > [mgt1 ~]$ PBS_DEFAULT=mgt2 qsub > hostname > qsub: Bad UID for job execution MSG=ruserok failed validating avb/avb from > mgt1 > > have an error. > > On mgt2, in logfile: > 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command > AuthenticateUser from avb > 03/26/2012 15:52:57;0100;PBS_Server;Req;;Type AuthenticateUser request > received from avb at mgt1.ssc, sock=14 > 03/26/2012 15:52:57;0008;PBS_Server;Job;dispatch_request;dispatching > request AuthenticateUser on sd=14 > 03/26/2012 15:52:57;0008;PBS_Server;Job;reply_send;Reply sent for request > type AuthenticateUser on socket 14 > 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command > Disconnect from PBS_Server > 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command > QueueJob from avb > 03/26/2012 15:52:57;0100;PBS_Server;Req;;Type QueueJob request received > from avb at mgt1.ssc, sock=13 > 03/26/2012 15:52:57;0008;PBS_Server;Job;dispatch_request;dispatching > request QueueJob on sd=13 > 03/26/2012 15:52:57;0080;PBS_Server;Job;62.mgt2;removed job file > 03/26/2012 15:52:57;0080;PBS_Server;Req;req_reject;Reject reply > code=15025(Bad UID for job execution MSG=ruserok failed validating avb/avb > from mgt1), aux=0, type=QueueJob, from avb at mgt1.ssc > 03/26/2012 15:52:57;0008;PBS_Server;Job;reply_send;Reply sent for request > type QueueJob on socket 13 > > Authentication on mgt1,mgt2 making by nss_ldap. Login to mgt2 by user avb > works ok. > > Can anyone halp, please... > > -- > Alexandr Baskakov, Samara State Aerospace University > e-mail: avb at ssau.ru > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/1247030a/attachment.html From dbeer at adaptivecomputing.com Tue Apr 3 09:15:16 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 09:15:16 -0600 Subject: [torqueusers] Simple Q. about controlling CPU utilization per user In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE45@exvic-mbx04.nexus.csiro.au> Message-ID: I don't know of an option to control cpu utilization, but have you looked into the cpusets feature? David On Mon, Mar 26, 2012 at 3:35 PM, Ian Miller wrote: > Hi All, > Is their a simple switch or config edit to curb the CPU utilization per > job submitted in torque? I'm running the 3.0.3. > Thx > > -I > > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/afd95bb7/attachment.html From martinliuhao at gmail.com Tue Apr 3 09:15:22 2012 From: martinliuhao at gmail.com (Martin Liu) Date: Tue, 03 Apr 2012 23:15:22 +0800 Subject: [torqueusers] A install problem In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> Message-ID: <4F7B140A.6020704@gmail.com> On 2012?04?03? 23:00, David Beer wrote: > Yes, and in 4.0.1 configure will check for these required libraries > and give you semi-intelligent messages about what you should get. > > David > > 2012/4/3 DuChene, StevenX A > > > I believe somewhere in the release notes it specifically states > that your need openssl-devel installed. > > -- > > Steven DuChene > > *From:*torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org > ] *On Behalf Of *?? > *Sent:* Tuesday, April 03, 2012 6:25 AM > *To:* torqueusers at supercluster.org > > *Subject:* [torqueusers] A install problem > > I met a problem when I installed the torque. > > Does any one can help me? > > I run > > $./configure > > $./make install > > and get > > [.....] > > make[3]: Entering directory > `/home/martinliu/soft/torque/src/daemon_client' > > /bin/sh ../../libtool --tag=CC --mode=link gcc -Wall -pthread > -ldl -lrt -lssl -lcrypto -g -O2 -L../lib/Libpbs/.libs -ltorque -o > trqauthd trq_auth_daemon.o trq_main.o -lpthread -lrt > > libtool: link: gcc -Wall -pthread -g -O2 -o .libs/trqauthd > trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto > -L/home/martinliu/soft/torque/src/lib/Libpbs/.libs > /home/martinliu/soft/torque/src/lib/Libpbs/.libs/libtorque.so > -lpthread -lrt -pthread > > /usr/bin/ld: cannot find -lssl > > /usr/bin/ld: cannot find -lcrypto > > collect2: ld returned 1 exit status > > make[3]: *** [trqauthd] Error 1 > > make[3]: Leaving directory > `/home/martinliu/soft/torque/src/daemon_client' > > make[2]: *** [install-recursive] Error 1 > > make[2]: Leaving directory > `/home/martinliu/soft/torque/src/daemon_client' > > make[1]: *** [install-recursive] Error 1 > > make[1]: Leaving directory `/home/martinliu/soft/torque/src' > > make: *** [install-recursive] Error 1 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > It sounds pretty good and of course very helpful. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/3cd8be28/attachment-0001.html From stevenx.a.duchene at intel.com Tue Apr 3 09:23:24 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 15:23:24 +0000 Subject: [torqueusers] A install problem In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805AC0687@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC06DB@ORSMSX106.amr.corp.intel.com> A few warm beers semi-intelligent or several vodka shooters and gin rickies semi-intelleigent? :) -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, April 03, 2012 8:00 AM To: Torque Users Mailing List Cc: ?? Subject: Re: [torqueusers] A install problem Yes, and in 4.0.1 configure will check for these required libraries and give you semi-intelligent messages about what you should get. David 2012/4/3 DuChene, StevenX A > I believe somewhere in the release notes it specifically states that your need openssl-devel installed. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of ?? Sent: Tuesday, April 03, 2012 6:25 AM To: torqueusers at supercluster.org Subject: [torqueusers] A install problem I met a problem when I installed the torque. Does any one can help me? I run $./configure $./make install and get [.....] make[3]: Entering directory `/home/martinliu/soft/torque/src/daemon_client' /bin/sh ../../libtool --tag=CC --mode=link gcc -Wall -pthread -ldl -lrt -lssl -lcrypto -g -O2 -L../lib/Libpbs/.libs -ltorque -o trqauthd trq_auth_daemon.o trq_main.o -lpthread -lrt libtool: link: gcc -Wall -pthread -g -O2 -o .libs/trqauthd trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto -L/home/martinliu/soft/torque/src/lib/Libpbs/.libs /home/martinliu/soft/torque/src/lib/Libpbs/.libs/libtorque.so -lpthread -lrt -pthread /usr/bin/ld: cannot find -lssl /usr/bin/ld: cannot find -lcrypto collect2: ld returned 1 exit status make[3]: *** [trqauthd] Error 1 make[3]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' make[2]: *** [install-recursive] Error 1 make[2]: Leaving directory `/home/martinliu/soft/torque/src/daemon_client' make[1]: *** [install-recursive] Error 1 make[1]: Leaving directory `/home/martinliu/soft/torque/src' make: *** [install-recursive] Error 1 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/0dbb2d7f/attachment.html From avb at ssau.ru Tue Apr 3 11:20:50 2012 From: avb at ssau.ru (Alexandr Baskakov) Date: Tue, 03 Apr 2012 21:20:50 +0400 Subject: [torqueusers] Can't submit job from remote submit host In-Reply-To: References: <4F7693A9.9080501@ssau.ru> Message-ID: <4F7B3172.6090902@ssau.ru> I have submit_hosts in my server settings. ... submit_hosts = mgt1.ssc ... Anyway, creating /etc/hosts.equiv with mgt1.ssc+ leads to the same result. qsub: Bad UID for job execution MSG=ruserok failed validating avb/avb from mgt1 03.04.2012 19:12, David Beer ?????: > What does your /etc/hosts.equiv file look like? You may need an entry that is something like: > > + > or > > > The + just allows all users. Note that this check is in addition to your TORQUE settings so having a plus there doesn't make everyone a manager or anything crazy like that. > > David > > On Fri, Mar 30, 2012 at 11:18 PM, Alexandr Baskakov > wrote: > > Hi, All. > > I'am trying to submit job from submit host to remote server with torque. > > Have 2 nodes: > mgt1 - torque client > mgt2 - torque server and moab. > > Domain: ssc > > On mgt2: > [mgt2 ~]$ qmgr -c 'l s' > Server mgt2 > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = localhost,mgt2,mgt1 > managers = root at mgt2,torque at mgt2 > operators = root at mgt2,torque at mgt2 > default_queue = batch > log_events = 511 > mail_from = adm > query_other_jobs = True > resources_assigned.ncpus = 0 > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > log_level = 7 > mom_job_sync = True > pbs_version = 3.0.2 > keep_completed = 300 > submit_hosts = mgt1.ssc > next_job_number = 57 > net_counter = 2 0 0 > > When I trying to submit job from mgt1 by: > > [mgt1 ~]$ PBS_DEFAULT=mgt2 qsub > hostname > qsub: Bad UID for job execution MSG=ruserok failed validating avb/avb from mgt1 > > have an error. > > On mgt2, in logfile: > 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command AuthenticateUser from avb > 03/26/2012 15:52:57;0100;PBS_Server;Req;;Type AuthenticateUser request received from avb at mgt1.ssc, sock=14 > 03/26/2012 15:52:57;0008;PBS_Server;Job;dispatch_request;dispatching request AuthenticateUser on sd=14 > 03/26/2012 15:52:57;0008;PBS_Server;Job;reply_send;Reply sent for request type AuthenticateUser on socket 14 > 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command Disconnect from PBS_Server > 03/26/2012 15:52:57;0080;PBS_Server;Req;dis_request_read;decoding command QueueJob from avb > 03/26/2012 15:52:57;0100;PBS_Server;Req;;Type QueueJob request received from avb at mgt1.ssc, sock=13 > 03/26/2012 15:52:57;0008;PBS_Server;Job;dispatch_request;dispatching request QueueJob on sd=13 > 03/26/2012 15:52:57;0080;PBS_Server;Job;62.mgt2;removed job file > 03/26/2012 15:52:57;0080;PBS_Server;Req;req_reject;Reject reply code=15025(Bad UID for job execution MSG=ruserok failed validating avb/avb from mgt1), aux=0, type=QueueJob, from avb at mgt1.ssc > 03/26/2012 15:52:57;0008;PBS_Server;Job;reply_send;Reply sent for request type QueueJob on socket 13 > > Authentication on mgt1,mgt2 making by nss_ldap. Login to mgt2 by user avb works ok. > > Can anyone halp, please... > > -- > Alexandr Baskakov, Samara State Aerospace University > e-mail: avb at ssau.ru > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Alexandr Baskakov, Samara State Aerospace University e-mail: avb at ssau.ru -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/69f9afbb/attachment-0001.html From jjc at iastate.edu Tue Apr 3 13:29:06 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Tue, 3 Apr 2012 19:29:06 +0000 Subject: [torqueusers] Simple Q. about controlling CPU utilization per user In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102B6D6AE45@exvic-mbx04.nexus.csiro.au> Message-ID: <242421BFAF465844BE24EB90BB97E22104FF56CC@ITSDAG3D.its.iastate.edu> Ian, Two suggestions: 1) You could try adding: renice +18 $$ into the script that starts the mom on the compute nodes This would cause all torque jobs to run with nice priority 18 (one above the minimal priority). ----- 2) One other possibility is to create the file /var/spool/torque/mom_priv/prologue containing: #!/bin/csh -f renice 18 -u $2 and issue chmod u+x /var/spool/torque/mom_priv/prologue do this on all your compute nodes. The causes all processes on the script currently owned by the user who submitted the job to be priority 18, along with any child tasks of those processes. You could even test for specific users, maybe exempting the person who bought the machine if they are just letting others use it. ----- You might #2 (perhaps with 12 rather than 18) if you are a in a research group letting grad students use your desktop Linux machine, while you use it interactively. I actually did this, limiting to ? the machine memory and renicing all batch jobs to 12, and exempting myself. I didn't see much slowdown in my interactive work, and the students used 15% of all the machine cycles over the 5 years I had the machine set up this way. The only cost was the extra memory I needed on my machine, and one extra disk for scratch. Then they did not all have to get large memory machines, which would have been idle most of the time. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ian Miller Sent: Monday, March 26, 2012 4:36 PM To: Torque Users Mailing List Subject: [torqueusers] Simple Q. about controlling CPU utilization per user Hi All, Is their a simple switch or config edit to curb the CPU utilization per job submitted in torque? I'm running the 3.0.3. Thx -I -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/0633442c/attachment.html From knielson at adaptivecomputing.com Tue Apr 3 15:43:17 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 3 Apr 2012 15:43:17 -0600 Subject: [torqueusers] TORQUE 4.0.1 release candidate available Message-ID: Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/c9beaf68/attachment.html From stevenx.a.duchene at intel.com Tue Apr 3 16:03:04 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 22:03:04 +0000 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: References: Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> Tried to use the packaged spec file in the snapshot to build rpms but it failed. It seems that a referenced include file is missing or the name is misspelled. gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f '../server/job_recov.c' || echo './'`../server/job_recov.c mom_comm.c:127:42: error: mom_job_func.h: No such file or directory mom_process_request.c:43:42: error: mom_job_func.h: No such file or directory mom_process_request.c: In function 'close_quejob': mom_process_request.c:540: warning: implicit declaration of function 'job_purge' catch_child.c:38:42: error: mom_job_func.h: No such file or directory mom_comm.c: In function 'im_join_job_as_sister': mom_comm.c:2427: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_process_request.o] Error 1 make[3]: *** Waiting for unfinished jobs.... mom_main.c:81:42: error: mom_job_func.h: No such file or directory catch_child.c: In function 'mom_deljob': catch_child.c:2008: warning: implicit declaration of function 'job_purge' mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory make[3]: *** [catch_child.o] Error 1 mom_req_quejob.c: In function 'req_quejob': mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' mom_main.c: In function 'rm_request': mom_main.c:4496: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_req_quejob.o] Error 1 make[3]: *** [mom_comm.o] Error 1 make[3]: *** [mom_main.o] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, April 03, 2012 2:43 PM To: torqueusers; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0.1 release candidate available Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/046a592e/attachment-0001.html From stevenx.a.duchene at intel.com Tue Apr 3 16:06:39 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 22:06:39 +0000 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC08CD@ORSMSX106.amr.corp.intel.com> I just checked my svn source tree for 4.0-fixes and the file is present there. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:03 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available Tried to use the packaged spec file in the snapshot to build rpms but it failed. It seems that a referenced include file is missing or the name is misspelled. gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f '../server/job_recov.c' || echo './'`../server/job_recov.c mom_comm.c:127:42: error: mom_job_func.h: No such file or directory mom_process_request.c:43:42: error: mom_job_func.h: No such file or directory mom_process_request.c: In function 'close_quejob': mom_process_request.c:540: warning: implicit declaration of function 'job_purge' catch_child.c:38:42: error: mom_job_func.h: No such file or directory mom_comm.c: In function 'im_join_job_as_sister': mom_comm.c:2427: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_process_request.o] Error 1 make[3]: *** Waiting for unfinished jobs.... mom_main.c:81:42: error: mom_job_func.h: No such file or directory catch_child.c: In function 'mom_deljob': catch_child.c:2008: warning: implicit declaration of function 'job_purge' mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory make[3]: *** [catch_child.o] Error 1 mom_req_quejob.c: In function 'req_quejob': mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' mom_main.c: In function 'rm_request': mom_main.c:4496: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_req_quejob.o] Error 1 make[3]: *** [mom_comm.o] Error 1 make[3]: *** [mom_main.o] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, April 03, 2012 2:43 PM To: torqueusers; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0.1 release candidate available Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/2cdf6bfd/attachment.html From stevenx.a.duchene at intel.com Tue Apr 3 16:12:54 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 22:12:54 +0000 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC08CD@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08CD@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC08E3@ORSMSX106.amr.corp.intel.com> I just pulled the svn tree for 4.0.1 and the file is there too so the tar ball must be incomplete. - Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:07 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available I just checked my svn source tree for 4.0-fixes and the file is present there. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:03 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available Tried to use the packaged spec file in the snapshot to build rpms but it failed. It seems that a referenced include file is missing or the name is misspelled. gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f '../server/job_recov.c' || echo './'`../server/job_recov.c mom_comm.c:127:42: error: mom_job_func.h: No such file or directory mom_process_request.c:43:42: error: mom_job_func.h: No such file or directory mom_process_request.c: In function 'close_quejob': mom_process_request.c:540: warning: implicit declaration of function 'job_purge' catch_child.c:38:42: error: mom_job_func.h: No such file or directory mom_comm.c: In function 'im_join_job_as_sister': mom_comm.c:2427: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_process_request.o] Error 1 make[3]: *** Waiting for unfinished jobs.... mom_main.c:81:42: error: mom_job_func.h: No such file or directory catch_child.c: In function 'mom_deljob': catch_child.c:2008: warning: implicit declaration of function 'job_purge' mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory make[3]: *** [catch_child.o] Error 1 mom_req_quejob.c: In function 'req_quejob': mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' mom_main.c: In function 'rm_request': mom_main.c:4496: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_req_quejob.o] Error 1 make[3]: *** [mom_comm.o] Error 1 make[3]: *** [mom_main.o] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, April 03, 2012 2:43 PM To: torqueusers; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0.1 release candidate available Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/80c91b88/attachment-0001.html From stevenx.a.duchene at intel.com Tue Apr 3 16:43:55 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 22:43:55 +0000 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC08E3@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08CD@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08E3@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0912@ORSMSX106.amr.corp.intel.com> FYI, The svn 4.0.1 source tree does seem to build into the required rpms just fine. I am currently in the process of updating to those newer rpms now on my 264 nodes. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:13 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available I just pulled the svn tree for 4.0.1 and the file is there too so the tar ball must be incomplete. - Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:07 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available I just checked my svn source tree for 4.0-fixes and the file is present there. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:03 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available Tried to use the packaged spec file in the snapshot to build rpms but it failed. It seems that a referenced include file is missing or the name is misspelled. gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f '../server/job_recov.c' || echo './'`../server/job_recov.c mom_comm.c:127:42: error: mom_job_func.h: No such file or directory mom_process_request.c:43:42: error: mom_job_func.h: No such file or directory mom_process_request.c: In function 'close_quejob': mom_process_request.c:540: warning: implicit declaration of function 'job_purge' catch_child.c:38:42: error: mom_job_func.h: No such file or directory mom_comm.c: In function 'im_join_job_as_sister': mom_comm.c:2427: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_process_request.o] Error 1 make[3]: *** Waiting for unfinished jobs.... mom_main.c:81:42: error: mom_job_func.h: No such file or directory catch_child.c: In function 'mom_deljob': catch_child.c:2008: warning: implicit declaration of function 'job_purge' mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory make[3]: *** [catch_child.o] Error 1 mom_req_quejob.c: In function 'req_quejob': mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' mom_main.c: In function 'rm_request': mom_main.c:4496: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_req_quejob.o] Error 1 make[3]: *** [mom_comm.o] Error 1 make[3]: *** [mom_main.o] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, April 03, 2012 2:43 PM To: torqueusers; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0.1 release candidate available Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/e105f339/attachment.html From dbeer at adaptivecomputing.com Tue Apr 3 17:04:41 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 17:04:41 -0600 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0912@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08CD@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC08E3@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0912@ORSMSX106.amr.corp.intel.com> Message-ID: We apologize for this problem - it appears that make distcheck wasn't run before this was sent out. Please check the new snapshot here: http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031702.tar.gz Thanks Steve for diving into this and finding the problem. David On Tue, Apr 3, 2012 at 4:43 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > FYI, The svn 4.0.1 source tree does seem to build into the required rpms > just fine.**** > > ** ** > > I am currently in the process of updating to those newer rpms now on my > 264 nodes.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *DuChene, StevenX A > *Sent:* Tuesday, April 03, 2012 3:13 PM > > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > I just pulled the svn tree for 4.0.1 and the file is there too so the tar > ball must be incomplete.**** > > ?**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *DuChene, StevenX A > *Sent:* Tuesday, April 03, 2012 3:07 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > I just checked my svn source tree for 4.0-fixes and the file is present > there.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *DuChene, StevenX A > *Sent:* Tuesday, April 03, 2012 3:03 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Tried to use the packaged spec file in the snapshot to build rpms but it > failed. It seems that a referenced include file is missing or the name is > misspelled.**** > > ** ** > > gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include > -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" > -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" > -DPBS_SERVER_HOME=\"/var/spool/torque\" > -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f > '../server/job_recov.c' || echo './'`../server/job_recov.c**** > > mom_comm.c:127:42: error: mom_job_func.h: No such file or directory**** > > mom_process_request.c:43:42: error: mom_job_func.h: No such file or > directory**** > > mom_process_request.c: In function 'close_quejob':**** > > mom_process_request.c:540: warning: implicit declaration of function > 'job_purge'**** > > catch_child.c:38:42: error: mom_job_func.h: No such file or directory**** > > mom_comm.c: In function 'im_join_job_as_sister':**** > > mom_comm.c:2427: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_process_request.o] Error 1**** > > make[3]: *** Waiting for unfinished jobs....**** > > mom_main.c:81:42: error: mom_job_func.h: No such file or directory**** > > catch_child.c: In function 'mom_deljob':**** > > catch_child.c:2008: warning: implicit declaration of function 'job_purge'* > *** > > mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory* > *** > > make[3]: *** [catch_child.o] Error 1**** > > mom_req_quejob.c: In function 'req_quejob':**** > > mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' > **** > > mom_main.c: In function 'rm_request':**** > > mom_main.c:4496: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_req_quejob.o] Error 1**** > > make[3]: *** [mom_comm.o] Error 1**** > > make[3]: *** [mom_main.o] Error 1**** > > make[3]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[2]: *** [all-recursive] Error 1**** > > make[2]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[1]: *** [all-recursive] Error 1**** > > make[1]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src'**** > > make: *** [all-recursive] Error 1**** > > error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build)**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *Ken Nielson > *Sent:* Tuesday, April 03, 2012 2:43 PM > *To:* torqueusers; Torque Developers mailing list > *Subject:* [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Hi all, > > There is a TORQUE 4.0.1 release candidate available for download. There > are several bug fixes and code refactoring over the 4.0.0 release. Please > see the CHANGELOG for a list of bug fixes and enhancements. > > The code can be downloaded from > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ > torque-4.0.1-snap.201204031514.tar.gz > > Please download this and let us know what problems you find. > > Regards > > Ken Nielson**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/8ed9de38/attachment-0001.html From stevenx.a.duchene at intel.com Tue Apr 3 17:14:00 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 23:14:00 +0000 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC094F@ORSMSX106.amr.corp.intel.com> I also download the snapshot file again on a completely different system and the mom_job_func.h file is still not present. I tried the same build over on that system from that newly downloaded snapshot file and if failed in the same spot due to the missing include file. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:03 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available Tried to use the packaged spec file in the snapshot to build rpms but it failed. It seems that a referenced include file is missing or the name is misspelled. gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f '../server/job_recov.c' || echo './'`../server/job_recov.c mom_comm.c:127:42: error: mom_job_func.h: No such file or directory mom_process_request.c:43:42: error: mom_job_func.h: No such file or directory mom_process_request.c: In function 'close_quejob': mom_process_request.c:540: warning: implicit declaration of function 'job_purge' catch_child.c:38:42: error: mom_job_func.h: No such file or directory mom_comm.c: In function 'im_join_job_as_sister': mom_comm.c:2427: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_process_request.o] Error 1 make[3]: *** Waiting for unfinished jobs.... mom_main.c:81:42: error: mom_job_func.h: No such file or directory catch_child.c: In function 'mom_deljob': catch_child.c:2008: warning: implicit declaration of function 'job_purge' mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory make[3]: *** [catch_child.o] Error 1 mom_req_quejob.c: In function 'req_quejob': mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' mom_main.c: In function 'rm_request': mom_main.c:4496: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_req_quejob.o] Error 1 make[3]: *** [mom_comm.o] Error 1 make[3]: *** [mom_main.o] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, April 03, 2012 2:43 PM To: torqueusers; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0.1 release candidate available Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/a194439b/attachment.html From dbeer at adaptivecomputing.com Tue Apr 3 17:19:29 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 3 Apr 2012 17:19:29 -0600 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC094F@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC094F@ORSMSX106.amr.corp.intel.com> Message-ID: Are you certain you used the correct tarball? I am able to build the new one and not the old one. Also: dbeer at napali:/home/dbeer/Downloads/torque-4.0.1# ls src/resmom/| grep mom_job_func.h mom_job_func.h David On Tue, Apr 3, 2012 at 5:14 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I also download the snapshot file again on a completely different system > and the mom_job_func.h file is still not present.**** > > ** ** > > I tried the same build over on that system from that newly downloaded > snapshot file and if failed in the same spot due to the missing include > file.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *DuChene, StevenX A > > *Sent:* Tuesday, April 03, 2012 3:03 PM > *To:* Torque Users Mailing List; Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Tried to use the packaged spec file in the snapshot to build rpms but it > failed. It seems that a referenced include file is missing or the name is > misspelled.**** > > ** ** > > gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include > -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" > -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" > -DPBS_SERVER_HOME=\"/var/spool/torque\" > -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f > '../server/job_recov.c' || echo './'`../server/job_recov.c**** > > mom_comm.c:127:42: error: mom_job_func.h: No such file or directory**** > > mom_process_request.c:43:42: error: mom_job_func.h: No such file or > directory**** > > mom_process_request.c: In function 'close_quejob':**** > > mom_process_request.c:540: warning: implicit declaration of function > 'job_purge'**** > > catch_child.c:38:42: error: mom_job_func.h: No such file or directory**** > > mom_comm.c: In function 'im_join_job_as_sister':**** > > mom_comm.c:2427: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_process_request.o] Error 1**** > > make[3]: *** Waiting for unfinished jobs....**** > > mom_main.c:81:42: error: mom_job_func.h: No such file or directory**** > > catch_child.c: In function 'mom_deljob':**** > > catch_child.c:2008: warning: implicit declaration of function 'job_purge'* > *** > > mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory* > *** > > make[3]: *** [catch_child.o] Error 1**** > > mom_req_quejob.c: In function 'req_quejob':**** > > mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' > **** > > mom_main.c: In function 'rm_request':**** > > mom_main.c:4496: warning: implicit declaration of function 'job_purge'**** > > make[3]: *** [mom_req_quejob.o] Error 1**** > > make[3]: *** [mom_comm.o] Error 1**** > > make[3]: *** [mom_main.o] Error 1**** > > make[3]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[2]: *** [all-recursive] Error 1**** > > make[2]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom'**** > > make[1]: *** [all-recursive] Error 1**** > > make[1]: Leaving directory > `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src'**** > > make: *** [all-recursive] Error 1**** > > error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build)**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [ > mailto:torqueusers-bounces at supercluster.org] > *On Behalf Of *Ken Nielson > *Sent:* Tuesday, April 03, 2012 2:43 PM > *To:* torqueusers; Torque Developers mailing list > *Subject:* [torqueusers] TORQUE 4.0.1 release candidate available**** > > ** ** > > Hi all, > > There is a TORQUE 4.0.1 release candidate available for download. There > are several bug fixes and code refactoring over the 4.0.0 release. Please > see the CHANGELOG for a list of bug fixes and enhancements. > > The code can be downloaded from > http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/ > torque-4.0.1-snap.201204031514.tar.gz > > Please download this and let us know what problems you find. > > Regards > > Ken Nielson**** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/09348906/attachment-0001.html From stevenx.a.duchene at intel.com Tue Apr 3 17:22:45 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 3 Apr 2012 23:22:45 +0000 Subject: [torqueusers] TORQUE 4.0.1 release candidate available In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805AC08BA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC094F@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0989@ORSMSX106.amr.corp.intel.com> Sorry, I sent the message below out before I noticed or got to read your follow up message with the new snapshot. I will try that now as soon as I finish downloading it to my development systems. Thanks for the update. I will let you know how the build goes. BTW, I tried subscribing to the torque-devel list but I think it was during the issues with the mail server and thus I did not get a valid subscription request that went through. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, April 03, 2012 4:19 PM To: Torque Users Mailing List Cc: Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available Are you certain you used the correct tarball? I am able to build the new one and not the old one. Also: dbeer at napali:/home/dbeer/Downloads/torque-4.0.1# ls src/resmom/| grep mom_job_func.h mom_job_func.h David On Tue, Apr 3, 2012 at 5:14 PM, DuChene, StevenX A > wrote: I also download the snapshot file again on a completely different system and the mom_job_func.h file is still not present. I tried the same build over on that system from that newly downloaded snapshot file and if failed in the same spot due to the missing include file. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, April 03, 2012 3:03 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0.1 release candidate available Tried to use the packaged spec file in the snapshot to build rpms but it failed. It seems that a referenced include file is missing or the name is misspelled. gcc -DHAVE_CONFIG_H -I. -I../../src/include -I../../src/include -I../../src/resmom/linux -DPBS_MOM -DDEMUX=\"/usr/sbin/pbs_demux\" -DRCP_PATH=\"/usr/bin/scp\" -DRCP_ARGS=\"-rpB\" -DPBS_SERVER_HOME=\"/var/spool/torque\" -DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c -o job_recov.o `test -f '../server/job_recov.c' || echo './'`../server/job_recov.c mom_comm.c:127:42: error: mom_job_func.h: No such file or directory mom_process_request.c:43:42: error: mom_job_func.h: No such file or directory mom_process_request.c: In function 'close_quejob': mom_process_request.c:540: warning: implicit declaration of function 'job_purge' catch_child.c:38:42: error: mom_job_func.h: No such file or directory mom_comm.c: In function 'im_join_job_as_sister': mom_comm.c:2427: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_process_request.o] Error 1 make[3]: *** Waiting for unfinished jobs.... mom_main.c:81:42: error: mom_job_func.h: No such file or directory catch_child.c: In function 'mom_deljob': catch_child.c:2008: warning: implicit declaration of function 'job_purge' mom_req_quejob.c:106:42: error: mom_job_func.h: No such file or directory make[3]: *** [catch_child.o] Error 1 mom_req_quejob.c: In function 'req_quejob': mom_req_quejob.c:353: warning: implicit declaration of function 'job_purge' mom_main.c: In function 'rm_request': mom_main.c:4496: warning: implicit declaration of function 'job_purge' make[3]: *** [mom_req_quejob.o] Error 1 make[3]: *** [mom_comm.o] Error 1 make[3]: *** [mom_main.o] Error 1 make[3]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src/resmom' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-4.0.1-snap.201204031514/src' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.9L8CSg (%build) From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, April 03, 2012 2:43 PM To: torqueusers; Torque Developers mailing list Subject: [torqueusers] TORQUE 4.0.1 release candidate available Hi all, There is a TORQUE 4.0.1 release candidate available for download. There are several bug fixes and code refactoring over the 4.0.0 release. Please see the CHANGELOG for a list of bug fixes and enhancements. The code can be downloaded from http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-4.0.1-snap.201204031514.tar.gz Please download this and let us know what problems you find. Regards Ken Nielson _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120403/f3c90f9c/attachment.html From stevenx.a.duchene at intel.com Tue Apr 3 20:15:29 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 4 Apr 2012 02:15:29 +0000 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server where I am building torque-4.X and in looking through the output from the configure script during the build I do not see anywhere that the existence of any hwloc stuff is checked. In fact in grepping through the output from the whole torque rpm build process I do not see ANY mention of hwloc at all. I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in the -help output from configure but according to the description text this is just supposed to over-ride the pkg-config results however I do not see any evidence that the pkg-config system is being quizzed at all for the existence of hwloc on the build server. Is there some step I am missing? I thought someone mentioned that there would be better documentation of the hwloc business in the torque-4.0.1 release? If so where is it? -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, March 19, 2012 8:54 AM To: Torque Users Mailing List Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced Steve, Hwloc is now required for running cpusets in TORQUE, and it helps out a lot both in immediate use and in groundwork for future features. Immediately hwloc gives you a better cpuset because it gives you the next core instead of the next indexed core. For example: many eight core systems have processors 0, 2, 4, and 6 next to each other and processors 1, 3, 5, and 7 next to each other. If you're running a pre-4.0 TORQUE, and you have two jobs on the node, each with 4 cores, job 1 will have 0-3 and job 2 will have 4-7. In TORQUE 4.0, job 1 will have 0, 2, 4, and 6, and job 2 will have 1, 3, 5, and 7. This should help speed up processing times for jobs (NOTE: only if you have this kind of system and a comparable job layout, I'm not promising a general speed-up to everyone using cpusets). This should also allow us to properly handle hyperthreading for anyone that has it turned on and wishes to use it. The last immediate feature is if you have SMT (simultaneous multi-threading) hardware. The mom config variable $use_smt was added. By default, the use of SMT is enabled, but you can tell your pbs_mom to ignore them (not place them in the cpuset) using by adding $use_smt false to your mom config file For the future, the hwloc threads make it really easy for us to handle hardware specific requests. One of the coming features for TORQUE is to allow requests roughly similar to: socket=2:numa=2 --with-hyperthreads which would say to spread the job over 2 sockets, and across the 2 numa nodes on each socket. This is a feature we plan to add to improve support for Magny-Cours and Opteron type processors that have multiple sockets and or multiple numa nodes on the processor chip. Using hwloc makes it so we don't have to parse system files and map the indices to the sockets and/or numa nodes ourselves, we can simply use easy hwloc functions like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to just move on to the next physical core or virtual core, or skip to the next socket or numa node as the case may be. David On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A > wrote: Also a better (more complete) explanation of what features are enabled when hwloc is used would be helpful as well. BTW, I built torque on my server without hwloc installed and then installed the resulting mom packages on my nodes. The mom daemons in that case did seem to start up just fine. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Craig West Sent: Sunday, March 18, 2012 10:40 PM To: Torque Users mailing list; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced Hi Steven, I have just begun testing Torque 4.0, as hwloc has been a long awaited feature for me. > It is unclear from this announcement text where hwloc has to be installed. > Is it just on the server or on the nodes only? It needs to be available on the BUILD server and the nodes. I tried to run pbs_mom on a node without the hwloc I had installed and it failed. Note: I am running hwloc 1.4 from a directory in /usr/local This was not automatically found by the TORQUE configure script, but you can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. It embeds the locations that you specify in the pbs_mom (and other files) but it seems you can set the LD_LIBRARY_PATH variable if it is not in the same location on the BUILD server as the compute nodes. For simplicity installing them in the same location makes sense. > More documentation about this would be greatly appreciated. I agree, clearer and more detailed documentation would be useful. Cheers, Craig. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120404/2b5386ae/attachment-0001.html From dbeer at adaptivecomputing.com Wed Apr 4 08:59:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 4 Apr 2012 08:59:42 -0600 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> Message-ID: Steven, I was supposed to add that note and I forgot - my mistake and thanks for catching it. I have now added: *** For admins that use cpusets in any form *** hwloc version 1.1 or greater is now required for building TORQUE with cpusets, as pbs_mom now uses the hwloc API to create the cpusets instead of creating them manually. to README.building_40. As far as checking for the existence of the library, this does happen at configure time once the configure script determines that the user is going to be using cpusets in any way, which a few different configure options can trigger. David On Tue, Apr 3, 2012 at 8:15 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server where I > am building torque-4.X and in looking through the output from the configure > script during the build I do not see anywhere that the existence of any > hwloc stuff is checked. In fact in grepping through the output from the > whole torque rpm build process I do not see ANY mention of hwloc at all.** > ** > > ** ** > > I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in the > ?help output from configure but according to the description text this is > just supposed to over-ride the pkg-config results however I do not see any > evidence that the pkg-config system is being quizzed at all for the > existence of hwloc on the build server.**** > > ** ** > > Is there some step I am missing?**** > > ** ** > > I thought someone mentioned that there would be better documentation of > the hwloc business in the torque-4.0.1 release?**** > > ** ** > > If so where is it?**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Monday, March 19, 2012 8:54 AM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] TORQUE 4.0 Officially Announced**** > > ** ** > > Steve,**** > > ** ** > > Hwloc is now required for running cpusets in TORQUE, and it helps out a > lot both in immediate use and in groundwork for future features.**** > > ** ** > > Immediately hwloc gives you a better cpuset because it gives you the next > core instead of the next indexed core. For example: many eight core systems > have processors 0, 2, 4, and 6 next to each other and processors 1, 3, 5, > and 7 next to each other. If you're running a pre-4.0 TORQUE, and you have > two jobs on the node, each with 4 cores, job 1 will have 0-3 and job 2 will > have 4-7. In TORQUE 4.0, job 1 will have 0, 2, 4, and 6, and job 2 will > have 1, 3, 5, and 7. This should help speed up processing times for jobs > (NOTE: only if you have this kind of system and a comparable job layout, > I'm not promising a general speed-up to everyone using cpusets). This > should also allow us to properly handle hyperthreading for anyone that has > it turned on and wishes to use it.**** > > ** ** > > The last immediate feature is if you have SMT (simultaneous > multi-threading) hardware. The mom config variable $use_smt was added. By > default, the use of SMT is enabled, but you can tell your pbs_mom to ignore > them (not place them in the cpuset) using by adding**** > > ** ** > > $use_smt false**** > > ** ** > > to your mom config file**** > > ** ** > > For the future, the hwloc threads make it really easy for us to handle > hardware specific requests. One of the coming features for TORQUE is to > allow requests roughly similar to:**** > > ** ** > > socket=2:numa=2 --with-hyperthreads**** > > ** ** > > which would say to spread the job over 2 sockets, and across the 2 numa > nodes on each socket. This is a feature we plan to add to improve support > for Magny-Cours and Opteron type processors that have multiple sockets and > or multiple numa nodes on the processor chip. Using hwloc makes it so we > don't have to parse system files and map the indices to the sockets and/or > numa nodes ourselves, we can simply use easy hwloc functions > like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to just move > on to the next physical core or virtual core, or skip to the next socket or > numa node as the case may be.**** > > ** ** > > David**** > > On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A < > stevenx.a.duchene at intel.com> wrote:**** > > Also a better (more complete) explanation of what features are enabled > when hwloc is used would be helpful as well. > > BTW, I built torque on my server without hwloc installed and then > installed the resulting mom packages on my nodes. The mom daemons in that > case did seem to start up just fine. > -- > Steven DuChene**** > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Craig West > Sent: Sunday, March 18, 2012 10:40 PM > To: Torque Users mailing list; Torque Developers mailing list**** > > Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced > > > Hi Steven, > > I have just begun testing Torque 4.0, as hwloc has been a long awaited > feature for me. > > > It is unclear from this announcement text where hwloc has to be > installed. > > Is it just on the server or on the nodes only? > > It needs to be available on the BUILD server and the nodes. I tried to > run pbs_mom on a node without the hwloc I had installed and it failed. > > Note: I am running hwloc 1.4 from a directory in /usr/local > This was not automatically found by the TORQUE configure script, but you > can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. > It embeds the locations that you specify in the pbs_mom (and other > files) but it seems you can set the LD_LIBRARY_PATH variable if it is > not in the same location on the BUILD server as the compute nodes. > For simplicity installing them in the same location makes sense. > > > More documentation about this would be greatly appreciated. > > I agree, clearer and more detailed documentation would be useful. > > Cheers, > Craig. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120404/5179d804/attachment.html From gus at ldeo.columbia.edu Wed Apr 4 09:50:43 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 04 Apr 2012 11:50:43 -0400 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> Message-ID: <4F7C6DD3.4090509@ldeo.columbia.edu> Hi David Not to hijack Steven's thread ... ... but just taking a quick ride on it ... :) Does the hwloc 1.1 requirement apply only to Torque 4.0? How about the older Torque series [2.X.Y, 3.X.Y] that use cpuset? [I am in the process of building 2.4.16 with cpuset.] Thank you, Gus Correa On 04/04/2012 10:59 AM, David Beer wrote: > Steven, > > I was supposed to add that note and I forgot - my mistake and thanks for > catching it. I have now added: > > *** For admins that use cpusets in any form *** > hwloc version 1.1 or greater is now required for building TORQUE with > cpusets, as pbs_mom now uses the > hwloc API to create the cpusets instead of creating them manually. > > to README.building_40. > > As far as checking for the existence of the library, this does happen at > configure time once the configure script determines that the user is > going to be using cpusets in any way, which a few different configure > options can trigger. > > David > > On Tue, Apr 3, 2012 at 8:15 PM, DuChene, StevenX A > > wrote: > > I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server > where I am building torque-4.X and in looking through the output > from the configure script during the build I do not see anywhere > that the existence of any hwloc stuff is checked. In fact in > grepping through the output from the whole torque rpm build process > I do not see ANY mention of hwloc at all.____ > > __ __ > > I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in > the ?help output from configure but according to the description > text this is just supposed to over-ride the pkg-config results > however I do not see any evidence that the pkg-config system is > being quizzed at all for the existence of hwloc on the build server.____ > > __ __ > > Is there some step I am missing?____ > > __ __ > > I thought someone mentioned that there would be better documentation > of the hwloc business in the torque-4.0.1 release?____ > > __ __ > > If so where is it?____ > > --____ > > Steven DuChene____ > > __ __ > > *From:*torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org > ] *On Behalf Of *David Beer > *Sent:* Monday, March 19, 2012 8:54 AM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] TORQUE 4.0 Officially Announced____ > > __ __ > > Steve,____ > > __ __ > > Hwloc is now required for running cpusets in TORQUE, and it helps > out a lot both in immediate use and in groundwork for future > features.____ > > __ __ > > Immediately hwloc gives you a better cpuset because it gives you the > next core instead of the next indexed core. For example: many eight > core systems have processors 0, 2, 4, and 6 next to each other and > processors 1, 3, 5, and 7 next to each other. If you're running a > pre-4.0 TORQUE, and you have two jobs on the node, each with 4 > cores, job 1 will have 0-3 and job 2 will have 4-7. In TORQUE 4.0, > job 1 will have 0, 2, 4, and 6, and job 2 will have 1, 3, 5, and 7. > This should help speed up processing times for jobs (NOTE: only if > you have this kind of system and a comparable job layout, I'm not > promising a general speed-up to everyone using cpusets). This should > also allow us to properly handle hyperthreading for anyone that has > it turned on and wishes to use it.____ > > __ __ > > The last immediate feature is if you have SMT (simultaneous > multi-threading) hardware. The mom config variable $use_smt was > added. By default, the use of SMT is enabled, but you can tell your > pbs_mom to ignore them (not place them in the cpuset) using by > adding____ > > __ __ > > $use_smt false____ > > __ __ > > to your mom config file____ > > __ __ > > For the future, the hwloc threads make it really easy for us to > handle hardware specific requests. One of the coming features for > TORQUE is to allow requests roughly similar to:____ > > __ __ > > socket=2:numa=2 --with-hyperthreads____ > > __ __ > > which would say to spread the job over 2 sockets, and across the 2 > numa nodes on each socket. This is a feature we plan to add to > improve support for Magny-Cours and Opteron type processors that > have multiple sockets and or multiple numa nodes on the processor > chip. Using hwloc makes it so we don't have to parse system files > and map the indices to the sockets and/or numa nodes ourselves, we > can simply use easy hwloc functions > like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to > just move on to the next physical core or virtual core, or skip to > the next socket or numa node as the case may be.____ > > __ __ > > David____ > > On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A > > > wrote:____ > > Also a better (more complete) explanation of what features are > enabled when hwloc is used would be helpful as well. > > BTW, I built torque on my server without hwloc installed and then > installed the resulting mom packages on my nodes. The mom daemons in > that case did seem to start up just fine. > -- > Steven DuChene____ > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org > ] On Behalf Of Craig West > Sent: Sunday, March 18, 2012 10:40 PM > To: Torque Users mailing list; Torque Developers mailing list____ > > Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced > > > Hi Steven, > > I have just begun testing Torque 4.0, as hwloc has been a long awaited > feature for me. > > > It is unclear from this announcement text where hwloc has to be > installed. > > Is it just on the server or on the nodes only? > > It needs to be available on the BUILD server and the nodes. I tried to > run pbs_mom on a node without the hwloc I had installed and it failed. > > Note: I am running hwloc 1.4 from a directory in /usr/local > This was not automatically found by the TORQUE configure script, but you > can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. > It embeds the locations that you specify in the pbs_mom (and other > files) but it seems you can set the LD_LIBRARY_PATH variable if it is > not in the same location on the BUILD server as the compute nodes. > For simplicity installing them in the same location makes sense. > > > More documentation about this would be greatly appreciated. > > I agree, clearer and more detailed documentation would be useful. > > Cheers, > Craig. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers____ > > > > ____ > > __ __ > > -- ____ > > David Beer | Software Engineer____ > > Adaptive Computing____ > > __ __ > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Wed Apr 4 09:52:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 4 Apr 2012 09:52:42 -0600 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <4F7C6DD3.4090509@ldeo.columbia.edu> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <4F7C6DD3.4090509@ldeo.columbia.edu> Message-ID: On Wed, Apr 4, 2012 at 9:50 AM, Gus Correa wrote: > Hi David > > Not to hijack Steven's thread ... > ... but just taking a quick ride on it ... :) > > Does the hwloc 1.1 requirement apply only to Torque 4.0? > How about the older Torque series [2.X.Y, 3.X.Y] > that use cpuset? > [I am in the process of building 2.4.16 with cpuset.] > > This only applies to 4.0 and higher. > Thank you, > Gus Correa > > On 04/04/2012 10:59 AM, David Beer wrote: > > Steven, > > > > I was supposed to add that note and I forgot - my mistake and thanks for > > catching it. I have now added: > > > > *** For admins that use cpusets in any form *** > > hwloc version 1.1 or greater is now required for building TORQUE with > > cpusets, as pbs_mom now uses the > > hwloc API to create the cpusets instead of creating them manually. > > > > to README.building_40. > > > > As far as checking for the existence of the library, this does happen at > > configure time once the configure script determines that the user is > > going to be using cpusets in any way, which a few different configure > > options can trigger. > > > > David > > > > On Tue, Apr 3, 2012 at 8:15 PM, DuChene, StevenX A > > > > wrote: > > > > I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server > > where I am building torque-4.X and in looking through the output > > from the configure script during the build I do not see anywhere > > that the existence of any hwloc stuff is checked. In fact in > > grepping through the output from the whole torque rpm build process > > I do not see ANY mention of hwloc at all.____ > > > > __ __ > > > > I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in > > the ?help output from configure but according to the description > > text this is just supposed to over-ride the pkg-config results > > however I do not see any evidence that the pkg-config system is > > being quizzed at all for the existence of hwloc on the build > server.____ > > > > __ __ > > > > Is there some step I am missing?____ > > > > __ __ > > > > I thought someone mentioned that there would be better documentation > > of the hwloc business in the torque-4.0.1 release?____ > > > > __ __ > > > > If so where is it?____ > > > > --____ > > > > Steven DuChene____ > > > > __ __ > > > > *From:*torqueusers-bounces at supercluster.org > > > > [mailto:torqueusers-bounces at supercluster.org > > ] *On Behalf Of *David > Beer > > *Sent:* Monday, March 19, 2012 8:54 AM > > *To:* Torque Users Mailing List > > *Subject:* Re: [torqueusers] TORQUE 4.0 Officially Announced____ > > > > __ __ > > > > Steve,____ > > > > __ __ > > > > Hwloc is now required for running cpusets in TORQUE, and it helps > > out a lot both in immediate use and in groundwork for future > > features.____ > > > > __ __ > > > > Immediately hwloc gives you a better cpuset because it gives you the > > next core instead of the next indexed core. For example: many eight > > core systems have processors 0, 2, 4, and 6 next to each other and > > processors 1, 3, 5, and 7 next to each other. If you're running a > > pre-4.0 TORQUE, and you have two jobs on the node, each with 4 > > cores, job 1 will have 0-3 and job 2 will have 4-7. In TORQUE 4.0, > > job 1 will have 0, 2, 4, and 6, and job 2 will have 1, 3, 5, and 7. > > This should help speed up processing times for jobs (NOTE: only if > > you have this kind of system and a comparable job layout, I'm not > > promising a general speed-up to everyone using cpusets). This should > > also allow us to properly handle hyperthreading for anyone that has > > it turned on and wishes to use it.____ > > > > __ __ > > > > The last immediate feature is if you have SMT (simultaneous > > multi-threading) hardware. The mom config variable $use_smt was > > added. By default, the use of SMT is enabled, but you can tell your > > pbs_mom to ignore them (not place them in the cpuset) using by > > adding____ > > > > __ __ > > > > $use_smt false____ > > > > __ __ > > > > to your mom config file____ > > > > __ __ > > > > For the future, the hwloc threads make it really easy for us to > > handle hardware specific requests. One of the coming features for > > TORQUE is to allow requests roughly similar to:____ > > > > __ __ > > > > socket=2:numa=2 --with-hyperthreads____ > > > > __ __ > > > > which would say to spread the job over 2 sockets, and across the 2 > > numa nodes on each socket. This is a feature we plan to add to > > improve support for Magny-Cours and Opteron type processors that > > have multiple sockets and or multiple numa nodes on the processor > > chip. Using hwloc makes it so we don't have to parse system files > > and map the indices to the sockets and/or numa nodes ourselves, we > > can simply use easy hwloc functions > > like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to > > just move on to the next physical core or virtual core, or skip to > > the next socket or numa node as the case may be.____ > > > > __ __ > > > > David____ > > > > On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A > > > > > wrote:____ > > > > Also a better (more complete) explanation of what features are > > enabled when hwloc is used would be helpful as well. > > > > BTW, I built torque on my server without hwloc installed and then > > installed the resulting mom packages on my nodes. The mom daemons in > > that case did seem to start up just fine. > > -- > > Steven DuChene____ > > > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org > > > > [mailto:torqueusers-bounces at supercluster.org > > ] On Behalf Of Craig > West > > Sent: Sunday, March 18, 2012 10:40 PM > > To: Torque Users mailing list; Torque Developers mailing list____ > > > > Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced > > > > > > Hi Steven, > > > > I have just begun testing Torque 4.0, as hwloc has been a long > awaited > > feature for me. > > > > > It is unclear from this announcement text where hwloc has to be > > installed. > > > Is it just on the server or on the nodes only? > > > > It needs to be available on the BUILD server and the nodes. I tried > to > > run pbs_mom on a node without the hwloc I had installed and it > failed. > > > > Note: I am running hwloc 1.4 from a directory in /usr/local > > This was not automatically found by the TORQUE configure script, but > you > > can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. > > It embeds the locations that you specify in the pbs_mom (and other > > files) but it seems you can set the LD_LIBRARY_PATH variable if it is > > not in the same location on the BUILD server as the compute nodes. > > For simplicity installing them in the same location makes sense. > > > > > More documentation about this would be greatly appreciated. > > > > I agree, clearer and more detailed documentation would be useful. > > > > Cheers, > > Craig. > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers____ > > > > > > > > ____ > > > > __ __ > > > > -- ____ > > > > David Beer | Software Engineer____ > > > > Adaptive Computing____ > > > > __ __ > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120404/63198595/attachment.html From stevenx.a.duchene at intel.com Wed Apr 4 11:02:19 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 4 Apr 2012 17:02:19 +0000 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> Hmmm, ok so there are certain configure options that have an effect on whether the configure script looks for hwloc . Do those include all or only some of the following? --enable-geometry-requests --enable-cpuset --enable-libcpuset --enable-numa-support I am trying to see if this gets correctly enabled when I build the rpms but in looking through the torque.spec file it is a little confusing. I see the following in the spec file: # bcond_without defaults to WITH, and vice versa. But then I see a little further: ### Features disabled by default %bcond_with blcr %bcond_with cpuset And on the line that actually calls configure from within the spec file I see: %configure --includedir=%{_includedir}/%{name} --with-default-server=%{torque_server} \ --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ --disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} %{ac_with_syslog} \ --disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} \ --disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} %{ac_with_spool} %{?acflags} So is "%bcond_with cpuset" supposed to turn it off or on? If it is supposed to turn it on then as I said before it is not working. Now I know I can just alter the spec file to hard code turn it on with "-enable-cpuset" or "-enable-libcpuset" or possibly "--enable-geometry-requests" but I am trying to understand the logic of what I see someone cleverly added into the torque spec file as distributed with the torque-4.0 sources. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 04, 2012 8:00 AM To: Torque Users Mailing List Cc: Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0 and hwloc Steven, I was supposed to add that note and I forgot - my mistake and thanks for catching it. I have now added: *** For admins that use cpusets in any form *** hwloc version 1.1 or greater is now required for building TORQUE with cpusets, as pbs_mom now uses the hwloc API to create the cpusets instead of creating them manually. to README.building_40. As far as checking for the existence of the library, this does happen at configure time once the configure script determines that the user is going to be using cpusets in any way, which a few different configure options can trigger. David On Tue, Apr 3, 2012 at 8:15 PM, DuChene, StevenX A > wrote: I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server where I am building torque-4.X and in looking through the output from the configure script during the build I do not see anywhere that the existence of any hwloc stuff is checked. In fact in grepping through the output from the whole torque rpm build process I do not see ANY mention of hwloc at all. I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in the -help output from configure but according to the description text this is just supposed to over-ride the pkg-config results however I do not see any evidence that the pkg-config system is being quizzed at all for the existence of hwloc on the build server. Is there some step I am missing? I thought someone mentioned that there would be better documentation of the hwloc business in the torque-4.0.1 release? If so where is it? -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, March 19, 2012 8:54 AM To: Torque Users Mailing List Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced Steve, Hwloc is now required for running cpusets in TORQUE, and it helps out a lot both in immediate use and in groundwork for future features. Immediately hwloc gives you a better cpuset because it gives you the next core instead of the next indexed core. For example: many eight core systems have processors 0, 2, 4, and 6 next to each other and processors 1, 3, 5, and 7 next to each other. If you're running a pre-4.0 TORQUE, and you have two jobs on the node, each with 4 cores, job 1 will have 0-3 and job 2 will have 4-7. In TORQUE 4.0, job 1 will have 0, 2, 4, and 6, and job 2 will have 1, 3, 5, and 7. This should help speed up processing times for jobs (NOTE: only if you have this kind of system and a comparable job layout, I'm not promising a general speed-up to everyone using cpusets). This should also allow us to properly handle hyperthreading for anyone that has it turned on and wishes to use it. The last immediate feature is if you have SMT (simultaneous multi-threading) hardware. The mom config variable $use_smt was added. By default, the use of SMT is enabled, but you can tell your pbs_mom to ignore them (not place them in the cpuset) using by adding $use_smt false to your mom config file For the future, the hwloc threads make it really easy for us to handle hardware specific requests. One of the coming features for TORQUE is to allow requests roughly similar to: socket=2:numa=2 --with-hyperthreads which would say to spread the job over 2 sockets, and across the 2 numa nodes on each socket. This is a feature we plan to add to improve support for Magny-Cours and Opteron type processors that have multiple sockets and or multiple numa nodes on the processor chip. Using hwloc makes it so we don't have to parse system files and map the indices to the sockets and/or numa nodes ourselves, we can simply use easy hwloc functions like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to just move on to the next physical core or virtual core, or skip to the next socket or numa node as the case may be. David On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A > wrote: Also a better (more complete) explanation of what features are enabled when hwloc is used would be helpful as well. BTW, I built torque on my server without hwloc installed and then installed the resulting mom packages on my nodes. The mom daemons in that case did seem to start up just fine. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Craig West Sent: Sunday, March 18, 2012 10:40 PM To: Torque Users mailing list; Torque Developers mailing list Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced Hi Steven, I have just begun testing Torque 4.0, as hwloc has been a long awaited feature for me. > It is unclear from this announcement text where hwloc has to be installed. > Is it just on the server or on the nodes only? It needs to be available on the BUILD server and the nodes. I tried to run pbs_mom on a node without the hwloc I had installed and it failed. Note: I am running hwloc 1.4 from a directory in /usr/local This was not automatically found by the TORQUE configure script, but you can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. It embeds the locations that you specify in the pbs_mom (and other files) but it seems you can set the LD_LIBRARY_PATH variable if it is not in the same location on the BUILD server as the compute nodes. For simplicity installing them in the same location makes sense. > More documentation about this would be greatly appreciated. I agree, clearer and more detailed documentation would be useful. Cheers, Craig. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120404/84506d4b/attachment-0001.html From gabe at msi.umn.edu Wed Apr 4 11:09:47 2012 From: gabe at msi.umn.edu (Gabe Turner) Date: Wed, 4 Apr 2012 12:09:47 -0500 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> Message-ID: <20120404170947.GC30037@blackice.msi.umn.edu> On Wed, Apr 04, 2012 at 05:02:19PM +0000, DuChene, StevenX A wrote: [snip] > And on the line that actually calls configure from within the spec file I see: > > %configure --includedir=%{_includedir}/%{name} --with-default-server=%{torque_server} \ > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ > --disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} %{ac_with_syslog} \ > --disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} \ > --disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} %{ac_with_spool} %{?acflags} > > So is "%bcond_with cpuset" supposed to turn it off or on? If it is > supposed to turn it on then as I said before it is not working. > > Now I know I can just alter the spec file to hard code turn it on with > "-enable-cpuset" or "-enable-libcpuset" or possibly > "--enable-geometry-requests" but I am trying to understand the logic of > what I see someone cleverly added into the torque spec file as > distributed with the torque-4.0 sources. It looks to me like the spec file is supporting the --with option to rpmbuild. So cpuset will be enabled as a configure option if you pass '--with cpuset' to rpmbuild. Is that what you are already trying? -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From dbeer at adaptivecomputing.com Wed Apr 4 11:11:20 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 4 Apr 2012 11:11:20 -0600 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> Message-ID: On Wed, Apr 4, 2012 at 11:02 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > Hmmm, ok so there are certain configure options that have an effect on > whether the configure script looks for hwloc .**** > > ** ** > > Do those include all or only some of the following?**** > > ** ** > > --enable-geometry-requests**** > > --enable-cpuset**** > > --enable-libcpuset**** > > --enable-numa-support**** > > ** > It includes all of those. > ** > > I am trying to see if this gets correctly enabled when I build the rpms > but in looking through the torque.spec file it is a little confusing. I see > the following in the spec file:**** > > ** ** > > # bcond_without defaults to WITH, and vice versa.**** > > ** ** > > But then I see a little further:**** > > ** ** > > ### Features disabled by default**** > > %bcond_with blcr**** > > %bcond_with cpuset**** > > ** ** > > And on the line that actually calls configure from within the spec file I > see:**** > > ** ** > > %configure --includedir=%{_includedir}/%{name} > --with-default-server=%{torque_server} \**** > > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \** > ** > > --disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} > %{ac_with_syslog} \**** > > --disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} > %{ac_with_drmaa} \**** > > --disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} > %{ac_with_spool} %{?acflags}**** > > ** ** > > So is ?%bcond_with cpuset? supposed to turn it off or on? If it is > supposed to turn it on then as I said before it is not working.**** > > ** ** > > Now I know I can just alter the spec file to hard code turn it on with > ??enable-cpuset? or ??enable-libcpuset? or possibly > ?--enable-geometry-requests? but I am trying to understand the logic of > what I see someone cleverly added into the torque spec file as distributed > with the torque-4.0 sources.**** > > --**** > > Steven DuChene**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 04, 2012 8:00 AM > > *To:* Torque Users Mailing List > *Cc:* Torque Developers mailing list > *Subject:* Re: [torqueusers] TORQUE 4.0 and hwloc**** > > ** ** > > Steven, > > I was supposed to add that note and I forgot - my mistake and thanks for > catching it. I have now added: > > *** For admins that use cpusets in any form *** > hwloc version 1.1 or greater is now required for building TORQUE with > cpusets, as pbs_mom now uses the > hwloc API to create the cpusets instead of creating them manually. > > to README.building_40. > > As far as checking for the existence of the library, this does happen at > configure time once the configure script determines that the user is going > to be using cpusets in any way, which a few different configure options can > trigger. > > David**** > > On Tue, Apr 3, 2012 at 8:15 PM, DuChene, StevenX A < > stevenx.a.duchene at intel.com> wrote:**** > > I installed hwloc-1.4.1 and hwloc-devel-1.4.1 rpms on the server where I > am building torque-4.X and in looking through the output from the configure > script during the build I do not see anywhere that the existence of any > hwloc stuff is checked. In fact in grepping through the output from the > whole torque rpm build process I do not see ANY mention of hwloc at all.** > ** > > **** > > I see compile time flags of HWLOC_CFLAGS and HWLOC_LIBS mentioned in the > ?help output from configure but according to the description text this is > just supposed to over-ride the pkg-config results however I do not see any > evidence that the pkg-config system is being quizzed at all for the > existence of hwloc on the build server.**** > > **** > > Is there some step I am missing?**** > > **** > > I thought someone mentioned that there would be better documentation of > the hwloc business in the torque-4.0.1 release?**** > > **** > > If so where is it?**** > > --**** > > Steven DuChene**** > > **** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Monday, March 19, 2012 8:54 AM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] TORQUE 4.0 Officially Announced**** > > **** > > Steve,**** > > **** > > Hwloc is now required for running cpusets in TORQUE, and it helps out a > lot both in immediate use and in groundwork for future features.**** > > **** > > Immediately hwloc gives you a better cpuset because it gives you the next > core instead of the next indexed core. For example: many eight core systems > have processors 0, 2, 4, and 6 next to each other and processors 1, 3, 5, > and 7 next to each other. If you're running a pre-4.0 TORQUE, and you have > two jobs on the node, each with 4 cores, job 1 will have 0-3 and job 2 will > have 4-7. In TORQUE 4.0, job 1 will have 0, 2, 4, and 6, and job 2 will > have 1, 3, 5, and 7. This should help speed up processing times for jobs > (NOTE: only if you have this kind of system and a comparable job layout, > I'm not promising a general speed-up to everyone using cpusets). This > should also allow us to properly handle hyperthreading for anyone that has > it turned on and wishes to use it.**** > > **** > > The last immediate feature is if you have SMT (simultaneous > multi-threading) hardware. The mom config variable $use_smt was added. By > default, the use of SMT is enabled, but you can tell your pbs_mom to ignore > them (not place them in the cpuset) using by adding**** > > **** > > $use_smt false**** > > **** > > to your mom config file**** > > **** > > For the future, the hwloc threads make it really easy for us to handle > hardware specific requests. One of the coming features for TORQUE is to > allow requests roughly similar to:**** > > **** > > socket=2:numa=2 --with-hyperthreads**** > > **** > > which would say to spread the job over 2 sockets, and across the 2 numa > nodes on each socket. This is a feature we plan to add to improve support > for Magny-Cours and Opteron type processors that have multiple sockets and > or multiple numa nodes on the processor chip. Using hwloc makes it so we > don't have to parse system files and map the indices to the sockets and/or > numa nodes ourselves, we can simply use easy hwloc functions > like hwloc_get_next_obj_inside_cpuset_by_type() that allow you to just move > on to the next physical core or virtual core, or skip to the next socket or > numa node as the case may be.**** > > **** > > David**** > > On Mon, Mar 19, 2012 at 8:47 AM, DuChene, StevenX A < > stevenx.a.duchene at intel.com> wrote:**** > > Also a better (more complete) explanation of what features are enabled > when hwloc is used would be helpful as well. > > BTW, I built torque on my server without hwloc installed and then > installed the resulting mom packages on my nodes. The mom daemons in that > case did seem to start up just fine. > -- > Steven DuChene**** > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Craig West > Sent: Sunday, March 18, 2012 10:40 PM > To: Torque Users mailing list; Torque Developers mailing list**** > > Subject: Re: [torqueusers] TORQUE 4.0 Officially Announced > > > Hi Steven, > > I have just begun testing Torque 4.0, as hwloc has been a long awaited > feature for me. > > > It is unclear from this announcement text where hwloc has to be > installed. > > Is it just on the server or on the nodes only? > > It needs to be available on the BUILD server and the nodes. I tried to > run pbs_mom on a node without the hwloc I had installed and it failed. > > Note: I am running hwloc 1.4 from a directory in /usr/local > This was not automatically found by the TORQUE configure script, but you > can specify the location using HWLOC_CFLAGS & HWLOC_LIBS. > It embeds the locations that you specify in the pbs_mom (and other > files) but it seems you can set the LD_LIBRARY_PATH variable if it is > not in the same location on the BUILD server as the compute nodes. > For simplicity installing them in the same location makes sense. > > > More documentation about this would be greatly appreciated. > > I agree, clearer and more detailed documentation would be useful. > > Cheers, > Craig. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > **** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120404/67d8cb28/attachment-0001.html From stevenx.a.duchene at intel.com Wed Apr 4 11:28:05 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 4 Apr 2012 17:28:05 +0000 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <20120404170947.GC30037@blackice.msi.umn.edu> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> <20120404170947.GC30037@blackice.msi.umn.edu> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC0F90@ORSMSX106.amr.corp.intel.com> No, I have not tried that yet but I will now. Thanks for the hint Gabe. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner Sent: Wednesday, April 04, 2012 10:10 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] TORQUE 4.0 and hwloc On Wed, Apr 04, 2012 at 05:02:19PM +0000, DuChene, StevenX A wrote: [snip] > And on the line that actually calls configure from within the spec file I see: > > %configure --includedir=%{_includedir}/%{name} --with-default-server=%{torque_server} \ > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ > --disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} %{ac_with_syslog} \ > --disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} \ > --disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} > %{ac_with_spool} %{?acflags} > > So is "%bcond_with cpuset" supposed to turn it off or on? If it is > supposed to turn it on then as I said before it is not working. > > Now I know I can just alter the spec file to hard code turn it on with > "-enable-cpuset" or "-enable-libcpuset" or possibly > "--enable-geometry-requests" but I am trying to understand the logic > of what I see someone cleverly added into the torque spec file as > distributed with the torque-4.0 sources. It looks to me like the spec file is supporting the --with option to rpmbuild. So cpuset will be enabled as a configure option if you pass '--with cpuset' to rpmbuild. Is that what you are already trying? -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From gabe at msi.umn.edu Wed Apr 4 11:51:02 2012 From: gabe at msi.umn.edu (Gabe Turner) Date: Wed, 4 Apr 2012 12:51:02 -0500 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <20120404170947.GC30037@blackice.msi.umn.edu> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> <20120404170947.GC30037@blackice.msi.umn.edu> Message-ID: <20120404175102.GD30037@blackice.msi.umn.edu> On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > It looks to me like the spec file is supporting the --with option to > rpmbuild. So cpuset will be enabled as a configure option if you pass > '--with cpuset' to rpmbuild. Is that what you are already trying? I just did this to build the RPMs with support for cpusets. Admittedly, it is a bit cumbersome, though perhaps only because I have hwloc installed in a centralized location and not from an RPM. gabe at node1084 [~/torque-4.0.1] % make rpm HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' gabe at node1084 [~/torque-4.0.1] % rpm -qRp ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm torque = 4.0.1-1.cri . . . libhwloc.so.5()(64bit) . . . -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From stevenx.a.duchene at intel.com Wed Apr 4 13:06:44 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 4 Apr 2012 19:06:44 +0000 Subject: [torqueusers] TORQUE 4.0 and hwloc In-Reply-To: <20120404175102.GD30037@blackice.msi.umn.edu> References: <560DBE57F33C4C4C9FBF11C662951AF805ABC89E@ORSMSX106.amr.corp.intel.com> <4F66C6BE.3010605@vpac.org> <560DBE57F33C4C4C9FBF11C662951AF805ABD031@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC09DA@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF805AC0F4A@ORSMSX106.amr.corp.intel.com> <20120404170947.GC30037@blackice.msi.umn.edu> <20120404175102.GD30037@blackice.msi.umn.edu> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805AC1006@ORSMSX106.amr.corp.intel.com> Yeah, when I passed the "-with cpuset" to the rpmbuild command then I got the following in my stdout from the actual rpm build process (from the configure part actually): checking whether to allow geometry requests... no checking whether to support NUMA systems... no checking for HWLOC... yes checking for hwloc_bitmap_alloc in -lhwloc... yes and then during the actual compile I see a -lhwloc being linked in. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gabe Turner Sent: Wednesday, April 04, 2012 10:51 AM To: torqueusers at supercluster.org Subject: Re: [torqueusers] TORQUE 4.0 and hwloc On Wed, Apr 04, 2012 at 12:09:47PM -0500, Gabe Turner wrote: > It looks to me like the spec file is supporting the --with option to > rpmbuild. So cpuset will be enabled as a configure option if you pass > '--with cpuset' to rpmbuild. Is that what you are already trying? I just did this to build the RPMs with support for cpusets. Admittedly, it is a bit cumbersome, though perhaps only because I have hwloc installed in a centralized location and not from an RPM. gabe at node1084 [~/torque-4.0.1] % make rpm HWLOC_CFLAGS='-I/soft/hwloc/1.4.1/include' HWLOC_LIBS='-L/soft/hwloc/1.4.1/lib -lhwloc' RPM_AC_OPTS+='--with cpuset' gabe at node1084 [~/torque-4.0.1] % rpm -qRp ~/rpmbuild/RPMS/x86_64/torque-client-4.0.1-1.cri.x86_64.rpm torque = 4.0.1-1.cri . . . libhwloc.so.5()(64bit) . . . -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From giggzounet at gmail.com Thu Apr 5 01:39:37 2012 From: giggzounet at gmail.com (giggzounet) Date: Thu, 05 Apr 2012 09:39:37 +0200 Subject: [torqueusers] Installation of Torque/Maui on Red Hat Santiago 6.2 Message-ID: Hi, I will install Torque/Maui on our cluster (Red Hat Santiago 6.2). I took the Torque version 2.5.9 and Maui 3.3.1. I have configured Torque with the following options: ./configure --prefix=/usr --mandir=/usr/share/man --libdir=/usr/lib/torque/lib64 --with-server_name=frontend.service --with-pam --disable-gui --with-scp --with-rcp=scp --with-modulefiles=/etc/modulefiles CC="gcc -m64" then generate the packages with make rpm. no error. Before I interrupt the production on the cluster. I would like to know if the configure options are ok ? Which one could also be useful ? Is the maui 3.3.1 version compatible with the Torque version 2.5.9 ? Thx a lot, Guillaume From knielson at adaptivecomputing.com Thu Apr 5 14:50:09 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 5 Apr 2012 14:50:09 -0600 Subject: [torqueusers] New 3.0.5 release candidate available Message-ID: Hi all, There is a new TORQUE 3.0.5 release candidate available. This candidate has a fix for yet another case where procct does not get deleted before getting passed to the scheduler making it so a job cannot run. See the CHANGELOG for all updates. The code can be downloaded at http://www.adaptivecomputing.com/resources/downloads/torque/snapshots/torque-3.0.5-snap.201204051313.tar.gz Please download this if you are using 3.0.x and let us know if you find any problems. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120405/c1c2641e/attachment.html From sebelk at gmail.com Mon Apr 9 04:20:08 2012 From: sebelk at gmail.com (Sergio Belkin) Date: Mon, 9 Apr 2012 07:20:08 -0300 Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) Message-ID: Hi, I'm using torque-mom-3.0.3 on Fedora 16. I'm completely newbie about of torque and I'm testing a pbs_server on a virtual machine an a pbs_client on the host. pbs_mom complains as follows on node (client machine): pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/lib/torque/spool/270.mpimaster.mycluster.OU sergio at mpimaster.mycluster:/home/sergio/STDIN.o270' failed with status=1, giving up after 4 attempts pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/lib/torque/spool/270.mpimaster.mycluster.OU to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/lib/torque/spool/270.mpimaster.mycluster.ER sergio at mpimaster.mycluster:/home/sergio/STDIN.e270' failed with status=1, giving up after 4 attempts pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/lib/torque/spool/270.mpimaster.mycluster.ER to sergio at mpimaster.mycluster:/home/sergio/STDIN.e270 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/lib/torque/spool/270.mpimaster.mycluster.OU to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 *** error from copy Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). lost connection *** end error output Output retained on that host in: /var/lib/torque/undelivered/270.mpimaster.mycluster.OU I've read documentation and google about this problem and it don't seem to be a problem of ssh/scp. So: *I've tested /usr/bin/scp -rpB somefile sergio at mpimaster.mycluster:/home/sergio and works with no problem *I've tested putting scp into crontab and works fine too Of course mpimaster.mycluster is in /home/sergio./known_hosts matches on mpinode02 (client machine with pbs mom running) with /etc/ssh/ssh_host_rsa_key.pub on mpimaster.mycluster ... (I use keychain on both cases) So, I don't know what I am doing wrong. Please could you help me to solve this problem? Thanks in advance! -- -- Sergio Belkin ?http://www.sergiobelkin.com Watch More TV http://sebelk.blogspot.com LPIC-2 Certified - http://www.lpi.org From sebelk at gmail.com Mon Apr 9 05:41:55 2012 From: sebelk at gmail.com (Sergio Belkin) Date: Mon, 9 Apr 2012 08:41:55 -0300 Subject: [torqueusers] OT: Don't read Message-ID: This is only a test -- -- Sergio Belkin ?http://www.sergiobelkin.com Watch More TV http://sebelk.blogspot.com LPIC-2 Certified - http://www.lpi.org From dbeer at adaptivecomputing.com Mon Apr 9 15:34:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 9 Apr 2012 15:34:42 -0600 Subject: [torqueusers] Does anyone use #shared? Message-ID: All, Does anyone out there used shared execution slots? This feature allows any number of jobs to be assigned to the same execution slot, as long as the job requests only a shared processor. I don't know of any customer that uses these, and I'd like to remove the code to support this from the post 4.0 TORQUE (trunk in subversion). This would simplify a number of routines and get rid of quite a bit of spaghetti code, so it'd be great if nobody uses it. Does anyone have objections to removing this 'feature'? -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120409/e7bd50e6/attachment.html From shantanugadgil at yahoo.com Mon Apr 9 22:07:20 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Mon, 9 Apr 2012 21:07:20 -0700 (PDT) Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) In-Reply-To: Message-ID: <1334030840.30437.YahooMailClassic@web120606.mail.ne1.yahoo.com> Hi, Lets assume the following ... You are submitting from 'submit_node' while logged in as user 'sergio' The job gets scheduled for 'client_node'. ( I cant make out the client node's hostname from the logs below) The reason is that the 'root at client_node' (pbs_mom is running as root) is not able to scp the file to 'segio at submit_node'. I would use the following steps to get over this: Login into the 'client_node' as 'root'. (Repeat following steps for each client_node in the cluster) Try to ssh into the 'sergio at sumbit_node' (these could be more than one if you have allowed many machine to be submit nodes) Also, from root at client_node ssh into the 'submit_node' using the FQDN ... the FQDN is usually what the pbs_mom uses. Password less ssh should work in both cases!!! Regards, Shantanu --- On Mon, 4/9/12, Sergio Belkin wrote: > From: Sergio Belkin > Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) > To: torqueusers at supercluster.org > Date: Monday, April 9, 2012, 3:50 PM > Hi, > > I'm using torque-mom-3.0.3 on Fedora 16. I'm completely > newbie about > of torque and I'm testing a pbs_server on a virtual machine > an a > pbs_client on the host. pbs_mom complains as follows on node > (client > machine): > > pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB > /var/lib/torque/spool/270.mpimaster.mycluster.OU > sergio at mpimaster.mycluster:/home/sergio/STDIN.o270' > failed with > status=1, giving up after 4 attempts > pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file > /var/lib/torque/spool/270.mpimaster.mycluster.OU to > sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 > pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB > /var/lib/torque/spool/270.mpimaster.mycluster.ER > sergio at mpimaster.mycluster:/home/sergio/STDIN.e270' > failed with > status=1, giving up after 4 attempts > pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file > /var/lib/torque/spool/270.mpimaster.mycluster.ER to > sergio at mpimaster.mycluster:/home/sergio/STDIN.e270 > pbs_mom: LOG_ERROR::req_cpyfile, > > Unable to copy file > /var/lib/torque/spool/270.mpimaster.mycluster.OU > to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 > *** error from copy > Permission denied > (publickey,gssapi-keyex,gssapi-with-mic,password). > lost connection > *** end error output > Output retained on that host in: > /var/lib/torque/undelivered/270.mpimaster.mycluster.OU > > I've read documentation and google about this problem and it > don't > seem to be a problem of ssh/scp. So: > > *I've tested /usr/bin/scp -rpB somefile > sergio at mpimaster.mycluster:/home/sergio???and > works with no problem > *I've tested putting scp into crontab and works fine too > > Of course mpimaster.mycluster is in > /home/sergio./known_hosts matches > on mpinode02 (client machine with pbs mom running) with > /etc/ssh/ssh_host_rsa_key.pub on mpimaster.mycluster ... > > (I use keychain on both cases) > > So, I don't know what I am doing wrong. Please could you > help me to > solve this problem? > > Thanks in advance! > -- > -- > Sergio Belkin ?http://www.sergiobelkin.com > Watch More TV http://sebelk.blogspot.com > LPIC-2 Certified - http://www.lpi.org > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From stevenx.a.duchene at intel.com Tue Apr 10 16:00:16 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 10 Apr 2012 22:00:16 +0000 Subject: [torqueusers] init script not stopping pbs_server process with 4.0-fixes Message-ID: <560DBE57F33C4C4C9FBF11C662951AF805FA5342@fmsmsx152.amr.corp.intel.com> I have noticed on our development cluster where I am running the 4.0-fixes torque rpms that in order to actually get the pbs_server process to stop running that I have to actually do a kill -9 on the PID from the command line since the /etc/init.d/pbs_server script does not work when issuing a stop using the script. If I run the pbs_server init script on my system running torque 2.5.7 the script does work and kills the process but with the 4.0-fixes code I am running on a test cluster this does not work. In fact the two scripts are slightly different. In the 2.5.7 version the killproc line in the init script reads: killproc pbs_server in the init script installed with the 4.0-fixes rpms the same line reads: killproc pbs_server -TERM In reading through the /etc/init.d/functions file and looking at other init scripts on the system this -TERM should be sending the -9 signal to the process but as I said this does not seem to be working. -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120410/b08cce60/attachment.html From sebelk at gmail.com Thu Apr 12 12:11:34 2012 From: sebelk at gmail.com (Sergio Belkin) Date: Thu, 12 Apr 2012 15:11:34 -0300 Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) In-Reply-To: <1334030840.30437.YahooMailClassic@web120606.mail.ne1.yahoo.com> References: <1334030840.30437.YahooMailClassic@web120606.mail.ne1.yahoo.com> Message-ID: 2012/4/10 Shantanu Gadgil : > Hi, > > Lets assume the following ... > You are submitting from 'submit_node' while logged in as user 'sergio' > The job gets scheduled for 'client_node'. ( I cant make out the client node's hostname from the logs below) > > The reason is that the 'root at client_node' (pbs_mom is running as root) is not able to scp the file to 'segio at submit_node'. > > I would use the following steps to get over this: > > Login into the 'client_node' as 'root'. (Repeat following steps for each client_node in the cluster) > Try to ssh into the 'sergio at sumbit_node' (these could be more than one if you have allowed many machine to be submit nodes) > > Also, from root at client_node ssh into the 'submit_node' using the FQDN ... the FQDN is usually what the pbs_mom uses. > > Password less ssh should work in both cases!!! > > Regards, > Shantanu > > > --- On Mon, 4/9/12, Sergio Belkin wrote: > >> From: Sergio Belkin >> Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) >> To: torqueusers at supercluster.org >> Date: Monday, April 9, 2012, 3:50 PM >> Hi, >> >> I'm using torque-mom-3.0.3 on Fedora 16. I'm completely >> newbie about >> of torque and I'm testing a pbs_server on a virtual machine >> an a >> pbs_client on the host. pbs_mom complains as follows on node >> (client >> machine): >> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB >> /var/lib/torque/spool/270.mpimaster.mycluster.OU >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270' >> failed with >> status=1, giving up after 4 attempts >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file >> /var/lib/torque/spool/270.mpimaster.mycluster.OU to >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB >> /var/lib/torque/spool/270.mpimaster.mycluster.ER >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270' >> failed with >> status=1, giving up after 4 attempts >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file >> /var/lib/torque/spool/270.mpimaster.mycluster.ER to >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270 >> pbs_mom: LOG_ERROR::req_cpyfile, >> >> Unable to copy file >> /var/lib/torque/spool/270.mpimaster.mycluster.OU >> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 >> *** error from copy >> Permission denied >> (publickey,gssapi-keyex,gssapi-with-mic,password). >> lost connection >> *** end error output >> Output retained on that host in: >> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU >> >> I've read documentation and google about this problem and it >> don't >> seem to be a problem of ssh/scp. So: >> >> *I've tested /usr/bin/scp -rpB somefile >> sergio at mpimaster.mycluster:/home/sergio???and >> works with no problem >> *I've tested putting scp into crontab and works fine too >> >> Of course mpimaster.mycluster is in >> /home/sergio./known_hosts matches >> on mpinode02 (client machine with pbs mom running) with >> /etc/ssh/ssh_host_rsa_key.pub on mpimaster.mycluster ... >> >> (I use keychain on both cases) >> >> So, I don't know what I am doing wrong. Please could you >> help me to >> solve this problem? >> >> Thanks in advance! >> -- >> -- Thanks Shantanu for your answer. It still failing: mpinode02.mycluster is a computing node mpimaster.mycluster is the server sergio is the non-root user that submits jobs I've tried: Creating /root/.ssh/config Host mpimaster.mycluster User sergio GSSAPIAuthentication no IdentityFile ~sergio/.ssh/id_rsa And appending to /root/.bashrc the following: /usr/bin/keychain --nogui ~sergio/.ssh/id_rsa source ~sergio/.keychain/sebelk.argentina-sh So I login on mpinode02 as test user (test is a non-root user), then I run "su root" and I could do ssh and scp to mpimaster with no problem, but when I submit a job via torque, failing again as my first post. I don't know what I doing wrong :( Please could you help me? Thanks in advance! -- -- Sergio Belkin ?http://www.sergiobelkin.com Watch More TV http://sebelk.blogspot.com LPIC-2 Certified - http://www.lpi.org From shantanugadgil at yahoo.com Thu Apr 12 12:36:44 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Thu, 12 Apr 2012 11:36:44 -0700 (PDT) Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) In-Reply-To: Message-ID: <1334255804.35679.YahooMailClassic@web120603.mail.ne1.yahoo.com> Hi Sergio, Please see my comments inline ... I have few queries and idea ... hopefully they'll help ... :) --- On Thu, 4/12/12, Sergio Belkin wrote: > From: Sergio Belkin > Subject: Re: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) > To: "Torque Users Mailing List" > Date: Thursday, April 12, 2012, 11:41 PM > 2012/4/10 Shantanu Gadgil : > > Hi, > > > > Lets assume the following ... > > You are submitting from 'submit_node' while logged in > as user 'sergio' > > The job gets scheduled for 'client_node'. ( I cant make > out the client node's hostname from the logs below) > > > > The reason is that the 'root at client_node' (pbs_mom is > running as root) is not able to scp the file to > 'segio at submit_node'. > > > > I would use the following steps to get over this: > > > > Login into the 'client_node' as 'root'. (Repeat > following steps for each client_node in the cluster) > > Try to ssh into the 'sergio at sumbit_node' (these could > be more than one if you have allowed many machine to be > submit nodes) > > > > Also, from root at client_node ssh into the 'submit_node' > using the FQDN ... the FQDN is usually what the pbs_mom > uses. > > > > Password less ssh should work in both cases!!! > > > > Regards, > > Shantanu > > > > > > --- On Mon, 4/9/12, Sergio Belkin > wrote: > > > >> From: Sergio Belkin > >> Subject: [torqueusers] Unable to copy output and > error files to the submission dir (scp works fine) > >> To: torqueusers at supercluster.org > >> Date: Monday, April 9, 2012, 3:50 PM > >> Hi, > >> > >> I'm using torque-mom-3.0.3 on Fedora 16. I'm > completely > >> newbie about > >> of torque and I'm testing a pbs_server on a virtual > machine > >> an a > >> pbs_client on the host. pbs_mom complains as > follows on node > >> (client > >> machine): > >> > >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp > -rpB > >> /var/lib/torque/spool/270.mpimaster.mycluster.OU > >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270' > >> failed with > >> status=1, giving up after 4 attempts > >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy > file > >> /var/lib/torque/spool/270.mpimaster.mycluster.OU > to > >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 > >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp > -rpB > >> /var/lib/torque/spool/270.mpimaster.mycluster.ER > >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270' > >> failed with > >> status=1, giving up after 4 attempts > >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy > file > >> /var/lib/torque/spool/270.mpimaster.mycluster.ER > to > >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270 > >> pbs_mom: LOG_ERROR::req_cpyfile, > >> > >> Unable to copy file > >> /var/lib/torque/spool/270.mpimaster.mycluster.OU > >> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 > >> *** error from copy > >> Permission denied > >> (publickey,gssapi-keyex,gssapi-with-mic,password). > >> lost connection > >> *** end error output > >> Output retained on that host in: > >> > /var/lib/torque/undelivered/270.mpimaster.mycluster.OU > >> > >> I've read documentation and google about this > problem and it > >> don't > >> seem to be a problem of ssh/scp. So: > >> > >> *I've tested /usr/bin/scp -rpB somefile > >> sergio at mpimaster.mycluster:/home/sergio???and > >> works with no problem > >> *I've tested putting scp into crontab and works > fine too > >> > >> Of course mpimaster.mycluster is in > >> /home/sergio./known_hosts matches > >> on mpinode02 (client machine with pbs mom running) > with > >> /etc/ssh/ssh_host_rsa_key.pub on > mpimaster.mycluster ... > >> > >> (I use keychain on both cases) > >> > >> So, I don't know what I am doing wrong. Please > could you > >> help me to > >> solve this problem? > >> > >> Thanks in advance! > >> -- > >> -- > > Thanks Shantanu for your answer. > > It still failing: > > mpinode02.mycluster is a computing node > mpimaster.mycluster is the server > sergio is the non-root user that submits jobs > > I've tried: > > Creating /root/.ssh/config > > Host mpimaster.mycluster > ? ? User sergio > ? ? GSSAPIAuthentication no > ? ? IdentityFile ~sergio/.ssh/id_rsa > > > And appending to /root/.bashrc the following: > > /usr/bin/keychain --nogui ~sergio/.ssh/id_rsa > source ~sergio/.keychain/sebelk.argentina-sh > > So I login on mpinode02 as test user (test is a? > non-root user), then > I run "su root" and I could do ssh and? scp to > mpimaster with no > problem, but when I submit a job via torque, failing again > as my first > post. I presume the user 'sergio' has a shared home directory on mpimaster and mpinode2 ? Which root user? mpimaster or mpinode2 ?!? I am confused a little here ... so I'll ask again ... * you sshed from root at mpinode2 to sergio at mpimaster ? OR * you sshed from root at mpinode2 to root at mpimaster? Did you try the fqdn while doing ssh from root at mpinode2? Can you please post the actual commandlines that you used ? Maybe pbs_mom doesn't source the startup script (.bashrc) and so never knows about the keychain ?!? Is it possible to try the above without keychain? i.e. append pubkey of root at mpinode2 to the file ~sergio/.ssh/authorized_keys (and root at mpimaster) ... and see if password less ssh works that way? Also, on a side note ... do you think the pbs_mom's "usecp" directive could help you get around this rather than allow "ssh" at all ?!? Ref: http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/a.cmomconfig.php Regards, Shantanu From Gareth.Williams at csiro.au Fri Apr 13 01:00:56 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 13 Apr 2012 17:00:56 +1000 Subject: [torqueusers] Does anyone use #shared? In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620102DCC742F4@exvic-mbx04.nexus.csiro.au> I have no objection. We are keen on using cpusets and allocating/dedicating cores. We did run a custom setup for many years with a 'overload' queue which put jobs in a special shared cpuset and faked the number of cpus to make the scheduler work compatibly. The facility was not valued by the user-base. We've more-or-less abandoned that idea now that we're using the cpuset integration in torque which would not easily support such a model - and don't care much. In any case, we never used they #shared feature. Gareth From: David Beer [mailto:dbeer at adaptivecomputing.com] Sent: Tuesday, 10 April 2012 7:35 AM To: Torque Users Mailing List Subject: [torqueusers] Does anyone use #shared? All, Does anyone out there used shared execution slots? This feature allows any number of jobs to be assigned to the same execution slot, as long as the job requests only a shared processor. I don't know of any customer that uses these, and I'd like to remove the code to support this from the post 4.0 TORQUE (trunk in subversion). This would simplify a number of routines and get rid of quite a bit of spaghetti code, so it'd be great if nobody uses it. Does anyone have objections to removing this 'feature'? -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120413/a633067a/attachment.html From dbeer at adaptivecomputing.com Fri Apr 13 09:30:53 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 13 Apr 2012 09:30:53 -0600 Subject: [torqueusers] Does anyone use #shared? In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102DCC742F4@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102DCC742F4@exvic-mbx04.nexus.csiro.au> Message-ID: On Fri, Apr 13, 2012 at 1:00 AM, wrote: > I have no objection. **** > > ** ** > > We are keen on using cpusets and allocating/dedicating cores. We did run > a custom setup for many years with a ?overload? queue which put jobs in a > special shared cpuset and faked the number of cpus to make the scheduler > work compatibly. The facility was not valued by the user-base. We?ve > more-or-less abandoned that idea now that we?re using the cpuset > integration in torque which would not easily support such a model ? and > don?t care much.**** > > ** ** > > In any case, we never used they #shared feature.**** > > ** > Gareth, Thanks for your reply. It doesn't sound like being able to 'overload' things is important to you, but I will point out that this would still be possible by claiming that there are more cpus than actually exist on the system. I think that there are some sites that do this (although I'm not completely certain who). The #shared feature isn't something that any scheduler I'm aware of can schedule, which is just another reason I see to remove the code since it would greatly simplify the code path. I'm going to wait a bit more to see what other responses we get because this would mean removing it, but judging by the lack of objections it seems that we may go ahead and remove this from the trunk. David > ** > > Gareth**** > > ** ** > > *From:* David Beer [mailto:dbeer at adaptivecomputing.com] > *Sent:* Tuesday, 10 April 2012 7:35 AM > *To:* Torque Users Mailing List > *Subject:* [torqueusers] Does anyone use #shared?**** > > ** ** > > All,**** > > ** ** > > Does anyone out there used shared execution slots? This feature allows any > number of jobs to be assigned to the same execution slot, as long as the > job requests only a shared processor. I don't know of any customer that > uses these, and I'd like to remove the code to support this from the post > 4.0 TORQUE (trunk in subversion). This would simplify a number of routines > and get rid of quite a bit of spaghetti code, so it'd be great if nobody > uses it. Does anyone have objections to removing this 'feature'? > **** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120413/c2024c2c/attachment.html From sebelk at gmail.com Mon Apr 16 04:25:53 2012 From: sebelk at gmail.com (Sergio Belkin) Date: Mon, 16 Apr 2012 07:25:53 -0300 Subject: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) In-Reply-To: <1334255804.35679.YahooMailClassic@web120603.mail.ne1.yahoo.com> References: <1334255804.35679.YahooMailClassic@web120603.mail.ne1.yahoo.com> Message-ID: 2012/4/12 Shantanu Gadgil : > Hi Sergio, > > Please see my comments inline ... I have few queries and idea ... hopefully they'll help ... :) > > --- On Thu, 4/12/12, Sergio Belkin wrote: > >> From: Sergio Belkin >> Subject: Re: [torqueusers] Unable to copy output and error files to the submission dir (scp works fine) >> To: "Torque Users Mailing List" >> Date: Thursday, April 12, 2012, 11:41 PM >> 2012/4/10 Shantanu Gadgil : >> > Hi, >> > >> > Lets assume the following ... >> > You are submitting from 'submit_node' while logged in >> as user 'sergio' >> > The job gets scheduled for 'client_node'. ( I cant make >> out the client node's hostname from the logs below) >> > >> > The reason is that the 'root at client_node' (pbs_mom is >> running as root) is not able to scp the file to >> 'segio at submit_node'. >> > >> > I would use the following steps to get over this: >> > >> > Login into the 'client_node' as 'root'. (Repeat >> following steps for each client_node in the cluster) >> > Try to ssh into the 'sergio at sumbit_node' (these could >> be more than one if you have allowed many machine to be >> submit nodes) >> > >> > Also, from root at client_node ssh into the 'submit_node' >> using the FQDN ... the FQDN is usually what the pbs_mom >> uses. >> > >> > Password less ssh should work in both cases!!! >> > >> > Regards, >> > Shantanu >> > >> > >> > --- On Mon, 4/9/12, Sergio Belkin >> wrote: >> > >> >> From: Sergio Belkin >> >> Subject: [torqueusers] Unable to copy output and >> error files to the submission dir (scp works fine) >> >> To: torqueusers at supercluster.org >> >> Date: Monday, April 9, 2012, 3:50 PM >> >> Hi, >> >> >> >> I'm using torque-mom-3.0.3 on Fedora 16. I'm >> completely >> >> newbie about >> >> of torque and I'm testing a pbs_server on a virtual >> machine >> >> an a >> >> pbs_client on the host. pbs_mom complains as >> follows on node >> >> (client >> >> machine): >> >> >> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp >> -rpB >> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU >> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270' >> >> failed with >> >> status=1, giving up after 4 attempts >> >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy >> file >> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU >> to >> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 >> >> pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp >> -rpB >> >> /var/lib/torque/spool/270.mpimaster.mycluster.ER >> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270' >> >> failed with >> >> status=1, giving up after 4 attempts >> >> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy >> file >> >> /var/lib/torque/spool/270.mpimaster.mycluster.ER >> to >> >> sergio at mpimaster.mycluster:/home/sergio/STDIN.e270 >> >> pbs_mom: LOG_ERROR::req_cpyfile, >> >> >> >> Unable to copy file >> >> /var/lib/torque/spool/270.mpimaster.mycluster.OU >> >> to sergio at mpimaster.mycluster:/home/sergio/STDIN.o270 >> >> *** error from copy >> >> Permission denied >> >> (publickey,gssapi-keyex,gssapi-with-mic,password). >> >> lost connection >> >> *** end error output >> >> Output retained on that host in: >> >> >> /var/lib/torque/undelivered/270.mpimaster.mycluster.OU >> >> >> >> I've read documentation and google about this >> problem and it >> >> don't >> >> seem to be a problem of ssh/scp. So: >> >> >> >> *I've tested /usr/bin/scp -rpB somefile >> >> sergio at mpimaster.mycluster:/home/sergio???and >> >> works with no problem >> >> *I've tested putting scp into crontab and works >> fine too >> >> >> >> Of course mpimaster.mycluster is in >> >> /home/sergio./known_hosts matches >> >> on mpinode02 (client machine with pbs mom running) >> with >> >> /etc/ssh/ssh_host_rsa_key.pub on >> mpimaster.mycluster ... >> >> >> >> (I use keychain on both cases) >> >> >> >> So, I don't know what I am doing wrong. Please >> could you >> >> help me to >> >> solve this problem? >> >> >> >> Thanks in advance! >> >> -- >> >> -- >> >> Thanks Shantanu for your answer. >> >> It still failing: >> >> mpinode02.mycluster is a computing node >> mpimaster.mycluster is the server >> sergio is the non-root user that submits jobs >> >> I've tried: >> >> Creating /root/.ssh/config >> >> Host mpimaster.mycluster >> ? ? User sergio >> ? ? GSSAPIAuthentication no >> ? ? IdentityFile ~sergio/.ssh/id_rsa >> >> >> And appending to /root/.bashrc the following: >> >> /usr/bin/keychain --nogui ~sergio/.ssh/id_rsa >> source ~sergio/.keychain/sebelk.argentina-sh >> >> So I login on mpinode02 as test user (test is a >> non-root user), then >> I run "su root" and I could do ssh and? scp to >> mpimaster with no >> problem, but when I submit a job via torque, failing again >> as my first >> post. > > I presume the user 'sergio' has a shared home directory on mpimaster and mpinode2 ? No he hasn't. Sorry for the question if it is stupid: Is that an error? > > Which root user? mpimaster or mpinode2 ?!? mpinode02 (client) root user > > I am confused a little here ... so I'll ask again ... > * you sshed from root at mpinode2 to sergio at mpimaster ? > OR > * you sshed from root at mpinode2 to root at mpimaster? I've sshed from root at mpinode2 to sergio at mpimaster > > Did you try the fqdn while doing ssh from root at mpinode2? Yes I did > > Can you please post the actual commandlines that you used ? Yes I can, remember that test is a user of mpinode02. I've performed such a jump of users to demonstrate that neither root nor sergio launch a login shell [sergio at sebelk ~]$ su - test Contrase?a: [test at sebelk ~]$ su Contrase?a: KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/ Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL * Found existing ssh-agent (2534) * Found existing gpg-agent (2560) * Adding 1 ssh key(s)... Enter passphrase for /root/.ssh/id_rsa: Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa) [root at sebelk test]# ssh sergio at mpimaster.mycluster Last login: Mon Apr 16 06:49:39 2012 from mpinode02.mycluster KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/ Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL * Found existing ssh-agent (1149) * Found existing gpg-agent (1175) * Known ssh key: /home/sergio/.ssh/id_rsa [sergio at mpimaster ~]$ As you can see, never the user it is prompted any passphrase or password either. Note that root does not launch a session a login shell and even so he can ssh as sergio to mpimaster.mycluster That's because I've appended to /root/.bashrc at mpinode02: ## START-Keychain ### # Let re-use ssh-agent and/or gpg-agent between logins #/usr/bin/keychain --nogui ~sergio/.ssh/id_rsa /usr/bin/keychain --nogui /root/.ssh/id_rsa #source ~sergio/.keychain/sebelk.argentina-sh source /root/.keychain/sebelk.argentina-sh ## End-Keychain ### > > Maybe pbs_mom doesn't source the startup script (.bashrc) and so never knows about the keychain ?!? Sure, I think that > > Is it possible to try the above without keychain? keychain allow you passwordless login > > i.e. append pubkey of root at mpinode2 to the file ~sergio/.ssh/authorized_keys (and root at mpimaster) pubkey of of root at mpinode2 is already at ~sergio/.ssh/authorized_keys and root at mpimaster > > ... and see if password less ssh works that way? No, in this case it asks for passphrase > > Also, on a side note ... do you think the pbs_mom's "usecp" directive could help you get around this rather than allow "ssh" at all ?!? > Ref: http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/a.cmomconfig.php I've tried to not use NFS and NIS they are ancient and insecure services (perhaps I'm wrong), but well, maybe is a better option What do you think? > > Regards, > Shantanu > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- -- Sergio Belkin ?http://www.sergiobelkin.com Watch More TV http://sebelk.blogspot.com LPIC-2 Certified - http://www.lpi.org From Gareth.Williams at csiro.au Mon Apr 16 07:35:09 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 16 Apr 2012 23:35:09 +1000 Subject: [torqueusers] Does anyone use #shared? In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102DCC742F4@exvic-mbx04.nexus.csiro.au> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102DCC74304@exvic-mbx04.nexus.csiro.au> -snip- > Thanks for your reply. It doesn't sound like being able to 'overload' things is important to you, but I will point out that this would still be possible by claiming that there are more cpus than actually exist on the system. I think that there are some sites that do this (although I'm not completely certain who).? -snip- > David ? Hi David, 'claiming that there are more cpus than actually exist' will not easily work with integrated cpuset support. It might be nice to have a sort of detailed cpuset overloading but we're hard up enough getting what we have to be solid. Gareth From dbeer at adaptivecomputing.com Mon Apr 16 09:18:41 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 16 Apr 2012 09:18:41 -0600 Subject: [torqueusers] Does anyone use #shared? In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102DCC74304@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102DCC742F4@exvic-mbx04.nexus.csiro.au> <007DECE986B47F4EABF823C1FBB19C620102DCC74304@exvic-mbx04.nexus.csiro.au> Message-ID: On Mon, Apr 16, 2012 at 7:35 AM, wrote: > -snip- > > Thanks for your reply. It doesn't sound like being able to 'overload' > things is important to you, but I will point out that this would still be > possible by claiming that there are more cpus than actually exist on the > system. I think that there are some sites that do this (although I'm not > completely certain who). > -snip- > > > David > > Hi David, > > 'claiming that there are more cpus than actually exist' will not easily > work with integrated cpuset support. It might be nice to have a sort of > detailed cpuset overloading but we're hard up enough getting what we have > to be solid. > > In the past we have told people not to do this if they are going to use cpusets. I'm not really sure what the best way to handle that would be although we could think of something if this is something that people are hoping to figure out. David Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120416/1a682924/attachment.html From jcahye at yahoo.com Wed Apr 11 01:59:40 2012 From: jcahye at yahoo.com (Jay R Santos) Date: Wed, 11 Apr 2012 00:59:40 -0700 (PDT) Subject: [torqueusers] how to install torque on a stand alone machine Message-ID: <1334131180.98514.YahooMailNeo@web112806.mail.gq1.yahoo.com> Hi all, can anyone please guide me on how to install torque on a single machine? I already followed installation guide at?http://www.clusterresources.com/torquedocs21/1.1installation.shtml?but it seems I'm lost on the compute node part and so on. Sorry for being a newbie here. and I hope to get positive feed backs. Thanks, Jay -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120411/96983586/attachment.html From hocks at sdsc.edu Fri Apr 13 10:57:40 2012 From: hocks at sdsc.edu (Eva Hocks) Date: Fri, 13 Apr 2012 09:57:40 -0700 (PDT) Subject: [torqueusers] Does anyone use #shared? In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102DCC742F4@exvic-mbx04.nexus.csiro.au> Message-ID: We are using cpuset on the vSMP nodes in non shared mode for compute jobs. Each job gets' it's own cpuset/cores assigned. shared mode would come in handy though when using the cores from the IO nodes to access the local large filesystem from a user job. The lustre processes are dedicated to those cores for filesystem access through the IO node interfaces. We are still testing Thanks Eva On Fri, 13 Apr 2012 Gareth.Williams at csiro.au wrote: > I have no objection. > > We are keen on using cpusets and allocating/dedicating cores. We did run a custom setup for many years with a 'overload' queue which put jobs in a special shared cpuset and faked the number of cpus to make the scheduler work compatibly. The facility was not valued by the user-base. We've more-or-less abandoned that idea now that we're using the cpuset integration in torque which would not easily support such a model - and don't care much. > > In any case, we never used they #shared feature. > > Gareth > > From: David Beer [mailto:dbeer at adaptivecomputing.com] > Sent: Tuesday, 10 April 2012 7:35 AM > To: Torque Users Mailing List > Subject: [torqueusers] Does anyone use #shared? > > All, > > Does anyone out there used shared execution slots? This feature allows any number of jobs to be assigned to the same execution slot, as long as the job requests only a shared processor. I don't know of any customer that uses these, and I'd like to remove the code to support this from the post 4.0 TORQUE (trunk in subversion). This would simplify a number of routines and get rid of quite a bit of spaghetti code, so it'd be great if nobody uses it. Does anyone have objections to removing this 'feature'? > > -- > David Beer | Software Engineer > Adaptive Computing > > From stevenx.a.duchene at intel.com Mon Apr 16 11:41:48 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 16 Apr 2012 17:41:48 +0000 Subject: [torqueusers] how to install torque on a stand alone machine In-Reply-To: <1334131180.98514.YahooMailNeo@web112806.mail.gq1.yahoo.com> References: <1334131180.98514.YahooMailNeo@web112806.mail.gq1.yahoo.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF80604658D@fmsmsx152.amr.corp.intel.com> You would simply run the pbs-mom process, the pbs-server process and assuming you are using the pbs-sched server, that as well all on the one system. You would also add the single host name into the /var/spool/torque/server_priv/nodes file and the same hostname in the /var/spool/torque/server_name file as well. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Jay R Santos Sent: Wednesday, April 11, 2012 1:00 AM To: torqueusers at supercluster.org Subject: [torqueusers] how to install torque on a stand alone machine Hi all, can anyone please guide me on how to install torque on a single machine? I already followed installation guide at http://www.clusterresources.com/torquedocs21/1.1installation.shtml but it seems I'm lost on the compute node part and so on. Sorry for being a newbie here. and I hope to get positive feed backs. Thanks, Jay -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120416/ff61712b/attachment-0001.html From sjf4 at uw.edu Mon Apr 16 14:24:07 2012 From: sjf4 at uw.edu (Stephen Fralich) Date: Mon, 16 Apr 2012 20:24:07 +0000 Subject: [torqueusers] init script not stopping pbs_server process with 4.0-fixes In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF805FA5342@fmsmsx152.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF805FA5342@fmsmsx152.amr.corp.intel.com> Message-ID: I've also observed that -TERM does not actually cause pbs_server to shutdown in v4.0. It reports receiving the signal in the log, but does not actually ever shut down. I waited 10 minutes and it still had not shut down. I killed it with -9, but unfortunately when I started it back up it no longer knew about any of the queues and removed all the jobs. Stephen ________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of DuChene, StevenX A [stevenx.a.duchene at intel.com] Sent: Tuesday, April 10, 2012 3:00 PM To: torqueusers at supercluster.org Subject: [torqueusers] init script not stopping pbs_server process with 4.0-fixes I have noticed on our development cluster where I am running the 4.0-fixes torque rpms that in order to actually get the pbs_server process to stop running that I have to actually do a kill -9 on the PID from the command line since the /etc/init.d/pbs_server script does not work when issuing a stop using the script. If I run the pbs_server init script on my system running torque 2.5.7 the script does work and kills the process but with the 4.0-fixes code I am running on a test cluster this does not work. In fact the two scripts are slightly different. In the 2.5.7 version the killproc line in the init script reads: killproc pbs_server in the init script installed with the 4.0-fixes rpms the same line reads: killproc pbs_server ?TERM In reading through the /etc/init.d/functions file and looking at other init scripts on the system this ?TERM should be sending the -9 signal to the process but as I said this does not seem to be working. -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120416/3edc1444/attachment.html From samuel at unimelb.edu.au Mon Apr 16 20:11:48 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 17 Apr 2012 12:11:48 +1000 Subject: [torqueusers] Does anyone use #shared? In-Reply-To: References: Message-ID: <4F8CD164.8010901@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/04/12 07:34, David Beer wrote: > Does anyone out there used shared execution slots? This feature > allows any number of jobs to be assigned to the same execution > slot, as long as the job requests only a shared processor. I don't > know of any customer that uses these, and I'd like to remove the > code to support this from the post 4.0 TORQUE (trunk in > subversion). I've never heard of it before.. :-) At VPAC we did use the "ts" option to mark the login node as "timeshared" for jobs on the copyq (an idea shamelessly knocked off from APAC/NCI/ANU in Canberra - Hi Dave) but I'm not sure if it ever got much use. But no, I don't believe we use shared at all here. cheers! Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+M0WQACgkQO2KABBYQAh+KCwCeMBbdyiXZ01Rb95PZaqWH+Mfg L2AAniWLZjosEcotzFcM0gHXe3VJDEAL =V7ZJ -----END PGP SIGNATURE----- From WJEdsall at dow.com Tue Apr 17 07:02:51 2012 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Tue, 17 Apr 2012 13:02:51 +0000 Subject: [torqueusers] strategies for bad nodes Message-ID: Hello list, I'm looking for ideas on how to prevent jobs from going to 'bad' nodes. There are a small handful of items which define a bad node for us such as ypbind not bound, maybe /scr is full, etc. We need to be able to customize this list. What might be built into torque to achieve this? It would be ideal if the node was not only passed by for a job but even offlined with a comment. Thanks, William -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120417/74b4d1e0/attachment.html From akohlmey at cmm.chem.upenn.edu Tue Apr 17 07:25:16 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Tue, 17 Apr 2012 09:25:16 -0400 Subject: [torqueusers] strategies for bad nodes In-Reply-To: References: Message-ID: hello william, On Tue, Apr 17, 2012 at 9:02 AM, Edsall, William (WJ) wrote: > Hello list, > > I?m looking for ideas on how to prevent jobs from going to ?bad? nodes. > There are a small handful of items which define a bad node for us such as > ypbind not bound, maybe /scr is full, etc. We need to be able to customize > this list. > What might be built into torque to achieve this? It would be ideal if the > node was not only passed by for a job but even offlined with a comment. yes. you can do this via a node check script. http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/11.2healthcheck.php we use it to determine known problematic conditions or pre-failure warnings and have the node go offline. cheers, axel. > > > > Thanks, > > William > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From rmckay at adaptivecomputing.com Tue Apr 17 08:14:52 2012 From: rmckay at adaptivecomputing.com (Rick McKay) Date: Tue, 17 Apr 2012 08:14:52 -0600 Subject: [torqueusers] strategies for bad nodes In-Reply-To: References: Message-ID: Hello William, Michael Jennings at Lawrence Berkley just did a great presentation about the Node Health Check subproject of Warewulf that you might want to look into that, too. It's an excellent expansion of what's in the Adaptive TORQUE docs. It's well-documented, implemented almost entirely in bash, and easy to extend for you specific needs. http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check Regards, Rick On Tue, Apr 17, 2012 at 7:02 AM, Edsall, William (WJ) wrote: > Hello list,**** > > I?m looking for ideas on how to prevent jobs from going to ?bad? nodes. > There are a small handful of items which define a bad node for us such as > ypbind not bound, maybe /scr is full, etc. We need to be able to customize > this list.**** > > ** ** > > What might be built into torque to achieve this? It would be ideal if the > node was not only passed by for a job but even offlined with a comment.*** > * > > ** ** > > Thanks,**** > > William**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120417/a0a51ed7/attachment-0001.html From dbeer at adaptivecomputing.com Tue Apr 17 09:34:35 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 17 Apr 2012 09:34:35 -0600 Subject: [torqueusers] Does anyone use #shared? In-Reply-To: <4F8CD164.8010901@unimelb.edu.au> References: <4F8CD164.8010901@unimelb.edu.au> Message-ID: On Mon, Apr 16, 2012 at 8:11 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 10/04/12 07:34, David Beer wrote: > > > Does anyone out there used shared execution slots? This feature > > allows any number of jobs to be assigned to the same execution > > slot, as long as the job requests only a shared processor. I don't > > know of any customer that uses these, and I'd like to remove the > > code to support this from the post 4.0 TORQUE (trunk in > > subversion). > > I've never heard of it before.. :-) > > At VPAC we did use the "ts" option to mark the login node as > "timeshared" for jobs on the copyq (an idea shamelessly knocked off > from APAC/NCI/ANU in Canberra - Hi Dave) but I'm not sure if it ever > got much use. > > But no, I don't believe we use shared at all here. > Perhaps I'm using the wrong name for it. A #shared request in qsub would request a timeshared node. Again, I don't believe that this feature is compatible with Moab or Maui. Do you still use this feature Chris? How upset would you be if this feature didn't exist in the release after 4.0.*? David > > cheers! > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk+M0WQACgkQO2KABBYQAh+KCwCeMBbdyiXZ01Rb95PZaqWH+Mfg > L2AAniWLZjosEcotzFcM0gHXe3VJDEAL > =V7ZJ > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120417/f2facac5/attachment.html From sebelk at gmail.com Tue Apr 17 10:23:24 2012 From: sebelk at gmail.com (Sergio Belkin) Date: Tue, 17 Apr 2012 13:23:24 -0300 Subject: [torqueusers] torque is working with openmpi? Message-ID: Hi, I'm testing torque on Fedora 16. The problem is that jobs are not sent to Data: torque server: mpimaster.mycluster torque client: mpinode02.mycluster [sergio at mpimaster cluster]$ ompi_info | grep tm MCA ras: tm (MCA v2.0, API v2.0, Component v1.5.4) MCA plm: tm (MCA v2.0, API v2.0, Component v1.5.4) MCA ess: tm (MCA v2.0, API v2.0, Component v1.5.4) torque configuration: [root at mpimaster sergio]# cat /etc/torque/pbs_environment PATH=/bin:/usr/bin LANG=C cat /etc/torque/server_name mpimaster.mycluster [root at mpimaster sergio]# cat /etc/hosts 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 192.168.122.1 mpinode02.mycluster mpinode02 192.168.122.2 mpimaster.mycluster mpimaster mpinode0 cat /var/lib/torque/server_priv/nodes mpimaster np=1 mpinode02 np=2 [sergio at mpimaster ~]$ qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch acl_user_enable = True set queue batch acl_users = sergio set queue batch resources_default.nodes = 2 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = mpimaster.mycluster set server acl_hosts += mpimaster set server acl_hosts += localhost set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 402 set server authorized_users = sergio at mpimaster set server authorized_users += sergio at mpinode02 Client configuration: [sergio at mpimaster ~]$ cat /etc/hosts 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 192.168.122.1 mpinode02.mycluster mpinode02 192.168.122.2 mpimaster.mycluster mpimaster mpinode01 Tiene correo nuevo en /var/spool/mail/sergio [sergio at mpimaster ~]$ cat /etc/torque/ mom/ pbs_environment sched/ server_name [sergio at mpimaster ~]$ cat /etc/torque/server_name mpimaster.mycluster [sergio at mpimaster ~]$ cat /etc/torque/pbs_environment PATH=/bin:/usr/bin LANG=C [sergio at mpimaster ~]$ cat /etc/torque/mom/config # Configuration for pbs_mom. $pbsserver mpimaster.mycluster Then I submit job via mpirun [sergio at mpimaster cluster]$ mpirun hello librdmacm: couldn't read ABI version. librdmacm: assuming: 4 CMA: unable to get RDMA device list -------------------------------------------------------------------------- [[54064,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: mpimaster.mycluster Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- If I use hostfile works: [sergio at mpimaster cluster]$ mpirun --hostfile myhostfile hello KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/ Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL * Found existing ssh-agent (1607) * Found existing gpg-agent (1690) * Known ssh key: /home/sergio/.ssh/id_rsa librdmacm: couldn't read ABI version. librdmacm: assuming: 4 CMA: unable to get RDMA device list -------------------------------------------------------------------------- [[54073,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: mpimaster.mycluster Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- Hello World! from process 2 out of 3 on mpinode02.mycluster Hello World! from process 1 out of 3 on mpinode02.mycluster Hello World! from process 0 out of 3 on mpimaster.mycluster Am I doing something bad? Thanks in advance! -- -- Sergio Belkin ?http://www.sergiobelkin.com Watch More TV http://sebelk.blogspot.com LPIC-2 Certified - http://www.lpi.org From gus at ldeo.columbia.edu Tue Apr 17 11:38:52 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 17 Apr 2012 13:38:52 -0400 Subject: [torqueusers] torque is working with openmpi? In-Reply-To: References: Message-ID: <4F8DAAAC.2030106@ldeo.columbia.edu> Hi Sergio A) Your OpenMPI seems to have built with Infinband support. However, as the error message say, you don't seem to have Infinband interfaces [or the openib kernel modules are not loaded]. To prevent OpenMPI to use Infiniband, add '-mca btl ^openib' to your mpirun command line. A cleaner solution is to build OpenMPI with support only to the hardware that you have in your machines. ** B) Also, to use the OpenMPI-Torque integration you must submit a job with *qsub*, not directly mpirun! Torque will assign a list of nodes that will be subsequently used by the mpirun *inside* the script that you submitted via qsub. This way you don't need to add a nodefile to the mpirun command line. For instance. Write a script like this [say my_script]: #PBS -l nodes=1:ppn=1 #PBS -q batch #PBS -N hello cd $PBS_O_WORKDIR mpirun -np 2 ./hello Then do: qsub my_script ** I hope this helps, Gus Correa On 04/17/2012 12:23 PM, Sergio Belkin wrote: > Hi, > > I'm testing torque on Fedora 16. The problem is that jobs are not sent to > Data: > > > > torque server: mpimaster.mycluster > torque client: mpinode02.mycluster > > [sergio at mpimaster cluster]$ ompi_info | grep tm > MCA ras: tm (MCA v2.0, API v2.0, Component v1.5.4) > MCA plm: tm (MCA v2.0, API v2.0, Component v1.5.4) > MCA ess: tm (MCA v2.0, API v2.0, Component v1.5.4) > > > torque configuration: > > [root at mpimaster sergio]# cat /etc/torque/pbs_environment > PATH=/bin:/usr/bin > LANG=C > > cat /etc/torque/server_name > mpimaster.mycluster > > [root at mpimaster sergio]# cat /etc/hosts > 127.0.0.1 localhost.localdomain localhost > ::1 localhost6.localdomain6 localhost6 > 192.168.122.1 mpinode02.mycluster mpinode02 > 192.168.122.2 mpimaster.mycluster mpimaster mpinode0 > > cat /var/lib/torque/server_priv/nodes > mpimaster np=1 > mpinode02 np=2 > > [sergio at mpimaster ~]$ qmgr -c 'p s' > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch acl_user_enable = True > set queue batch acl_users = sergio > set queue batch resources_default.nodes = 2 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = mpimaster.mycluster > set server acl_hosts += mpimaster > set server acl_hosts += localhost > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 402 > set server authorized_users = sergio at mpimaster > set server authorized_users += sergio at mpinode02 > > > Client configuration: > > [sergio at mpimaster ~]$ cat /etc/hosts > 127.0.0.1 localhost.localdomain localhost > ::1 localhost6.localdomain6 localhost6 > 192.168.122.1 mpinode02.mycluster mpinode02 > 192.168.122.2 mpimaster.mycluster mpimaster mpinode01 > Tiene correo nuevo en /var/spool/mail/sergio > [sergio at mpimaster ~]$ cat /etc/torque/ > mom/ pbs_environment sched/ server_name > [sergio at mpimaster ~]$ cat /etc/torque/server_name > mpimaster.mycluster > [sergio at mpimaster ~]$ cat /etc/torque/pbs_environment > PATH=/bin:/usr/bin > LANG=C > [sergio at mpimaster ~]$ cat /etc/torque/mom/config > # Configuration for pbs_mom. > $pbsserver mpimaster.mycluster > > > Then I submit job via mpirun > > [sergio at mpimaster cluster]$ mpirun hello > librdmacm: couldn't read ABI version. > librdmacm: assuming: 4 > CMA: unable to get RDMA device list > -------------------------------------------------------------------------- > [[54064,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: OpenFabrics (openib) > Host: mpimaster.mycluster > > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > > > > If I use hostfile works: > > [sergio at mpimaster cluster]$ mpirun --hostfile myhostfile hello > > KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/ > Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL > > * Found existing ssh-agent (1607) > * Found existing gpg-agent (1690) > * Known ssh key: /home/sergio/.ssh/id_rsa > > librdmacm: couldn't read ABI version. > librdmacm: assuming: 4 > CMA: unable to get RDMA device list > -------------------------------------------------------------------------- > [[54073,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: OpenFabrics (openib) > Host: mpimaster.mycluster > > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > Hello World! from process 2 out of 3 on mpinode02.mycluster > Hello World! from process 1 out of 3 on mpinode02.mycluster > Hello World! from process 0 out of 3 on mpimaster.mycluster > > Am I doing something bad? > > Thanks in advance! > From gus at ldeo.columbia.edu Tue Apr 17 11:43:41 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 17 Apr 2012 13:43:41 -0400 Subject: [torqueusers] torque is working with openmpi? In-Reply-To: <4F8DAAAC.2030106@ldeo.columbia.edu> References: <4F8DAAAC.2030106@ldeo.columbia.edu> Message-ID: <4F8DABCD.6030709@ldeo.columbia.edu> Oops: #PBS -l nodes=2 not > #PBS -l nodes=1:ppn=1 On 04/17/2012 01:38 PM, Gus Correa wrote: > Hi Sergio > > A) Your OpenMPI seems to have built with Infinband support. > However, as the error message say, you don't seem to have > Infinband interfaces [or the openib kernel modules are not > loaded]. > > To prevent OpenMPI to use Infiniband, > add '-mca btl ^openib' > to your mpirun command line. > > A cleaner solution is to build OpenMPI with support only > to the hardware that you have in your machines. > > ** > > B) Also, to use the OpenMPI-Torque integration you must > submit a job with *qsub*, not directly mpirun! > Torque will assign a list of nodes that will be > subsequently used by the mpirun *inside* the script that > you submitted via qsub. > This way you don't need to add a nodefile > to the mpirun command line. > > For instance. > > Write a script like this [say my_script]: > #PBS -l nodes=1:ppn=1 > #PBS -q batch > #PBS -N hello > cd $PBS_O_WORKDIR > mpirun -np 2 ./hello > > Then do: > qsub my_script > > ** > > I hope this helps, > Gus Correa > > On 04/17/2012 12:23 PM, Sergio Belkin wrote: >> Hi, >> >> I'm testing torque on Fedora 16. The problem is that jobs are not sent to >> Data: >> >> >> >> torque server: mpimaster.mycluster >> torque client: mpinode02.mycluster >> >> [sergio at mpimaster cluster]$ ompi_info | grep tm >> MCA ras: tm (MCA v2.0, API v2.0, Component v1.5.4) >> MCA plm: tm (MCA v2.0, API v2.0, Component v1.5.4) >> MCA ess: tm (MCA v2.0, API v2.0, Component v1.5.4) >> >> >> torque configuration: >> >> [root at mpimaster sergio]# cat /etc/torque/pbs_environment >> PATH=/bin:/usr/bin >> LANG=C >> >> cat /etc/torque/server_name >> mpimaster.mycluster >> >> [root at mpimaster sergio]# cat /etc/hosts >> 127.0.0.1 localhost.localdomain localhost >> ::1 localhost6.localdomain6 localhost6 >> 192.168.122.1 mpinode02.mycluster mpinode02 >> 192.168.122.2 mpimaster.mycluster mpimaster mpinode0 >> >> cat /var/lib/torque/server_priv/nodes >> mpimaster np=1 >> mpinode02 np=2 >> >> [sergio at mpimaster ~]$ qmgr -c 'p s' >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch acl_user_enable = True >> set queue batch acl_users = sergio >> set queue batch resources_default.nodes = 2 >> set queue batch resources_default.walltime = 01:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = mpimaster.mycluster >> set server acl_hosts += mpimaster >> set server acl_hosts += localhost >> set server default_queue = batch >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server next_job_number = 402 >> set server authorized_users = sergio at mpimaster >> set server authorized_users += sergio at mpinode02 >> >> >> Client configuration: >> >> [sergio at mpimaster ~]$ cat /etc/hosts >> 127.0.0.1 localhost.localdomain localhost >> ::1 localhost6.localdomain6 localhost6 >> 192.168.122.1 mpinode02.mycluster mpinode02 >> 192.168.122.2 mpimaster.mycluster mpimaster mpinode01 >> Tiene correo nuevo en /var/spool/mail/sergio >> [sergio at mpimaster ~]$ cat /etc/torque/ >> mom/ pbs_environment sched/ server_name >> [sergio at mpimaster ~]$ cat /etc/torque/server_name >> mpimaster.mycluster >> [sergio at mpimaster ~]$ cat /etc/torque/pbs_environment >> PATH=/bin:/usr/bin >> LANG=C >> [sergio at mpimaster ~]$ cat /etc/torque/mom/config >> # Configuration for pbs_mom. >> $pbsserver mpimaster.mycluster >> >> >> Then I submit job via mpirun >> >> [sergio at mpimaster cluster]$ mpirun hello >> librdmacm: couldn't read ABI version. >> librdmacm: assuming: 4 >> CMA: unable to get RDMA device list >> -------------------------------------------------------------------------- >> [[54064,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> Host: mpimaster.mycluster >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> >> >> >> If I use hostfile works: >> >> [sergio at mpimaster cluster]$ mpirun --hostfile myhostfile hello >> >> KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/ >> Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL >> >> * Found existing ssh-agent (1607) >> * Found existing gpg-agent (1690) >> * Known ssh key: /home/sergio/.ssh/id_rsa >> >> librdmacm: couldn't read ABI version. >> librdmacm: assuming: 4 >> CMA: unable to get RDMA device list >> -------------------------------------------------------------------------- >> [[54073,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> Host: mpimaster.mycluster >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> Hello World! from process 2 out of 3 on mpinode02.mycluster >> Hello World! from process 1 out of 3 on mpinode02.mycluster >> Hello World! from process 0 out of 3 on mpimaster.mycluster >> >> Am I doing something bad? >> >> Thanks in advance! >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sebelk at gmail.com Tue Apr 17 15:12:15 2012 From: sebelk at gmail.com (Sergio Belkin) Date: Tue, 17 Apr 2012 18:12:15 -0300 Subject: [torqueusers] torque is working with openmpi? In-Reply-To: <4F8DAAAC.2030106@ldeo.columbia.edu> References: <4F8DAAAC.2030106@ldeo.columbia.edu> Message-ID: 2012/4/17 Gus Correa : > Hi Sergio > > A) Your OpenMPI seems to have built with Infinband support. > However, as the error message say, you don't seem to have > Infinband interfaces [or the openib kernel modules are not > loaded]. > > To prevent OpenMPI to use Infiniband, > add '-mca btl ^openib' > to your mpirun command line. > > A cleaner solution is to build OpenMPI with support only > to the hardware that you have in your machines. Thanks for the hint! > > ** > > B) Also, to use the OpenMPI-Torque integration you must > submit a job with *qsub*, not directly mpirun! > Torque will assign a list of nodes that will be > subsequently used by the mpirun *inside* the script that > you submitted via qsub. > This way you don't need to add a nodefile > to the mpirun command line. > > For instance. > > Write a script like this [say my_script]: > #PBS -l nodes=1:ppn=1 > #PBS -q batch > #PBS -N hello > cd $PBS_O_WORKDIR > mpirun -np 2 ./hello > > Then do: > qsub my_script Thanks for your help I've got the idea, that's worked! > > ** > > I hope this helps, > Gus Correa > > On 04/17/2012 12:23 PM, Sergio Belkin wrote: >> Hi, >> >> I'm testing torque on Fedora 16. The problem is that jobs are not sent to >> Data: >> >> >> >> torque server: mpimaster.mycluster >> torque client: mpinode02.mycluster >> >> [sergio at mpimaster cluster]$ ompi_info | grep tm >> ? ? ? ? ? ? ? ? ? MCA ras: tm (MCA v2.0, API v2.0, Component v1.5.4) >> ? ? ? ? ? ? ? ? ? MCA plm: tm (MCA v2.0, API v2.0, Component v1.5.4) >> ? ? ? ? ? ? ? ? ? MCA ess: tm (MCA v2.0, API v2.0, Component v1.5.4) >> >> >> torque configuration: >> >> [root at mpimaster sergio]# cat /etc/torque/pbs_environment >> PATH=/bin:/usr/bin >> LANG=C >> >> cat /etc/torque/server_name >> mpimaster.mycluster >> >> [root at mpimaster sergio]# cat /etc/hosts >> 127.0.0.1 ? ? ? ? ? ? ? localhost.localdomain localhost >> ::1 ? ? ? ? ? ? localhost6.localdomain6 localhost6 >> 192.168.122.1 ? mpinode02.mycluster mpinode02 >> 192.168.122.2 ? mpimaster.mycluster mpimaster mpinode0 >> >> cat /var/lib/torque/server_priv/nodes >> mpimaster np=1 >> mpinode02 np=2 >> >> [sergio at mpimaster ~]$ qmgr -c 'p s' >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch acl_user_enable = True >> set queue batch acl_users = sergio >> set queue batch resources_default.nodes = 2 >> set queue batch resources_default.walltime = 01:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = mpimaster.mycluster >> set server acl_hosts += mpimaster >> set server acl_hosts += localhost >> set server default_queue = batch >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server next_job_number = 402 >> set server authorized_users = sergio at mpimaster >> set server authorized_users += sergio at mpinode02 >> >> >> Client configuration: >> >> [sergio at mpimaster ~]$ cat /etc/hosts >> 127.0.0.1 ? ? ? ? ? ? ? localhost.localdomain localhost >> ::1 ? ? ? ? ? ? localhost6.localdomain6 localhost6 >> 192.168.122.1 ? mpinode02.mycluster mpinode02 >> 192.168.122.2 ? mpimaster.mycluster mpimaster mpinode01 >> Tiene correo nuevo en /var/spool/mail/sergio >> [sergio at mpimaster ~]$ cat /etc/torque/ >> mom/ ? ? ? ? ? ? pbs_environment ?sched/ ? ? ? ? ? server_name >> [sergio at mpimaster ~]$ cat /etc/torque/server_name >> mpimaster.mycluster >> [sergio at mpimaster ~]$ cat /etc/torque/pbs_environment >> PATH=/bin:/usr/bin >> LANG=C >> [sergio at mpimaster ~]$ cat /etc/torque/mom/config >> # Configuration for pbs_mom. >> $pbsserver mpimaster.mycluster >> >> >> Then I submit job via mpirun >> >> [sergio at mpimaster cluster]$ mpirun ?hello >> librdmacm: couldn't read ABI version. >> librdmacm: assuming: 4 >> CMA: unable to get RDMA device list >> -------------------------------------------------------------------------- >> [[54064,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> ? ?Host: mpimaster.mycluster >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> >> >> >> If I use hostfile works: >> >> [sergio at mpimaster cluster]$ mpirun --hostfile myhostfile hello >> >> KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/ >> Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL >> >> ? * Found existing ssh-agent (1607) >> ? * Found existing gpg-agent (1690) >> ? * Known ssh key: /home/sergio/.ssh/id_rsa >> >> librdmacm: couldn't read ABI version. >> librdmacm: assuming: 4 >> CMA: unable to get RDMA device list >> -------------------------------------------------------------------------- >> [[54073,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> ? ?Host: mpimaster.mycluster >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> Hello World! from process 2 out of 3 on mpinode02.mycluster >> Hello World! from process 1 out of 3 on mpinode02.mycluster >> Hello World! from process 0 out of 3 on mpimaster.mycluster >> >> Am I doing something bad? >> >> Thanks in advance! >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- -- Sergio Belkin ?http://www.sergiobelkin.com Watch More TV http://sebelk.blogspot.com LPIC-2 Certified - http://www.lpi.org From agarwal1975 at gmail.com Wed Apr 18 08:22:12 2012 From: agarwal1975 at gmail.com (Ashish Agarwal) Date: Wed, 18 Apr 2012 10:22:12 -0400 Subject: [torqueusers] can user limit max jobs running Message-ID: I've read some posts about how an admin can limit the maximum number of jobs a user can run. But I'd like to know if a user can him/herself limit the number of jobs *running*, ideally jobs within the same job array. The use case is that I would like to submit 500 I/O intensive jobs. Each job only requires a single core, so potentially all jobs could start running on our cluster overloading the storage system. Thus, I'd like to say "amongst all 500 jobs in this job array, run at most 20 at a time". Is that possible? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/fe5aa8c3/attachment.html From sm4082 at nyu.edu Wed Apr 18 08:52:41 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 18 Apr 2012 10:52:41 -0400 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: Message-ID: <1BD0A547-E150-4DEE-9D63-9F47156AFBAB@nyu.edu> Hi Ashish, I don't know whether user can do this by issuing one simple command. Easiest way would be to make second set of 20 jobs depend on first 20. So at any time there would be only 20 jobs running. You can write a simple script to achieve this. If you want I can show you how to do it. Anyway check our qsub tutorial for dependancy and other stuff like getting job numbers into variables so that you can use them in scripts. Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Apr 18, 2012, at 10:22, Ashish Agarwal wrote: > I've read some posts about how an admin can limit the maximum number of jobs a user can run. But I'd like to know if a user can him/herself limit the number of jobs *running*, ideally jobs within the same job array. > > The use case is that I would like to submit 500 I/O intensive jobs. Each job only requires a single core, so potentially all jobs could start running on our cluster overloading the storage system. Thus, I'd like to say "amongst all 500 jobs in this job array, run at most 20 at a time". Is that possible? > > Thank you. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From L.S.Lowe at bham.ac.uk Wed Apr 18 09:00:20 2012 From: L.S.Lowe at bham.ac.uk (Lawrence Lowe) Date: Wed, 18 Apr 2012 16:00:20 +0100 (BST) Subject: [torqueusers] can user limit max jobs running In-Reply-To: <1BD0A547-E150-4DEE-9D63-9F47156AFBAB@nyu.edu> References: <1BD0A547-E150-4DEE-9D63-9F47156AFBAB@nyu.edu> Message-ID: Hi Ashish, in man qsub, it gives an example: qsub script.sh -t 0-299%5 This sets the slot limit to 5. Only 5 jobs from this array can run at the same time. Hope that's what you're looking for. Tel: 0121 414 4621 Fax: 0121 414 6709 Email: L.S.Lowe at bham.ac.uk On Wed, 18 Apr 2012, Sreedhar Manchu wrote: > Hi Ashish, > > I don't know whether user can do this by issuing one simple command. Easiest way would be to make second set of 20 jobs depend on first 20. So at any time there would be only 20 jobs running. You can write a simple script to achieve this. If you want I can show you how to do it. Anyway check our qsub tutorial for dependancy and other stuff like getting job numbers into variables so that you can use them in scripts. > > Sreedhar. > > -- > Sent from my phone. Please excuse my brevity and any typos. > > On Apr 18, 2012, at 10:22, Ashish Agarwal wrote: > >> I've read some posts about how an admin can limit the maximum number of jobs a user can run. But I'd like to know if a user can him/herself limit the number of jobs *running*, ideally jobs within the same job array. >> >> The use case is that I would like to submit 500 I/O intensive jobs. Each job only requires a single core, so potentially all jobs could start running on our cluster overloading the storage system. Thus, I'd like to say "amongst all 500 jobs in this job array, run at most 20 at a time". Is that possible? >> >> Thank you. >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From ataufer at adaptivecomputing.com Wed Apr 18 09:26:58 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Wed, 18 Apr 2012 09:26:58 -0600 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: Message-ID: It sounds like slot limits on a job array would do what you want. The man page for qsub says An optional slot limit can be specified to limit the amount of jobs that can run concurrently in the job array. The default value is unlimited. The slot limit must be the last thing specified in the array_request and is delimited from the array by a percent sign (%). qsub script.sh -t 0-299%5 This sets the slot limit to 5. Only 5 jobs from this array can run at the same time. On Wed, Apr 18, 2012 at 8:22 AM, Ashish Agarwal wrote: > I've read some posts about how an admin can limit the maximum number of jobs > a user can run. But I'd like to know if a user can him/herself limit the > number of jobs *running*, ideally jobs within the same job array. > > The use case is that I would like to submit 500 I/O intensive jobs. Each job > only requires a single core, so potentially all jobs could start running on > our cluster overloading the storage system. Thus, I'd like to say "amongst > all 500 jobs in this job array, run at most 20 at a time". Is that possible? > > Thank you. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From rsvancara at wsu.edu Wed Apr 18 09:44:33 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Wed, 18 Apr 2012 15:44:33 +0000 Subject: [torqueusers] NUMA -- A first try Message-ID: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> Hi, I have compiled torque 3.0.4 with NUMA support per this document. http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml I have created the server_priv/nodes and mom_priv/mom.layout file Here are the versions of software: [root at node11 bin]# pbs_mom -v version: 3.0.4 [root at mgt1 server_priv]# pbs_server -v version: 3.0.4 lstopo shows: [root at node11 bin]# ./lstopo Machine (24GB) NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11) Mom.layout: cpus=0-5 mem=0 cpus=6-11 mem=1 server_priv/nodes: node11 num_numa_nodes=2 compute I restart pbs_server on management node and pbs_mom on node11. pbsnodes -a shows node11-0 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 node11-1 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 mom_log on node11 has: 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp' 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days) 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file? So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters: $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp Can anyone provide further illumination on my already dark dreary day? Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/36167409/attachment-0001.html From agarwal1975 at gmail.com Wed Apr 18 10:20:59 2012 From: agarwal1975 at gmail.com (Ashish Agarwal) Date: Wed, 18 Apr 2012 12:20:59 -0400 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: Message-ID: The slot limit is exactly what I'm looking for, but: * It is not mentioned in our man pages. I read now the man page on clusterresources.com, which is quite different than our installed man page. * It works on one of our clusters, but on a second one I get "qsub: Bad Job Array Request." Does this feature have to be enabled? Or is it only available from a certain version? The cluster it works on is qsub version: 2.5.10, and the one it does not work on is version: 2.4.12. On Wed, Apr 18, 2012 at 11:26 AM, Al Taufer wrote: > It sounds like slot limits on a job array would do what you want. The > man page for qsub says > > An optional slot limit can be specified to limit the amount of jobs > that can run concurrently in the job array. The default value is > unlimited. The slot limit must be the last thing specified in the > array_request and is delimited from the array by a percent sign (%). > > qsub script.sh -t 0-299%5 > > This sets the slot limit to 5. Only 5 jobs from this array can run at > the same time. > > On Wed, Apr 18, 2012 at 8:22 AM, Ashish Agarwal > wrote: > > I've read some posts about how an admin can limit the maximum number of > jobs > > a user can run. But I'd like to know if a user can him/herself limit the > > number of jobs *running*, ideally jobs within the same job array. > > > > The use case is that I would like to submit 500 I/O intensive jobs. Each > job > > only requires a single core, so potentially all jobs could start running > on > > our cluster overloading the storage system. Thus, I'd like to say > "amongst > > all 500 jobs in this job array, run at most 20 at a time". Is that > possible? > > > > Thank you. > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/f38b9b13/attachment.html From ataufer at adaptivecomputing.com Wed Apr 18 10:34:49 2012 From: ataufer at adaptivecomputing.com (Al Taufer) Date: Wed, 18 Apr 2012 10:34:49 -0600 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: Message-ID: It was not in the 2.4 version. It first showed up sometime in the 2.5 version and should be shown in the 2.5 man pages. On Wed, Apr 18, 2012 at 10:20 AM, Ashish Agarwal wrote: > The slot limit is exactly what I'm looking for, but: > > * It is not mentioned in our man pages. I read now the man page on > clusterresources.com, which is quite different than our installed man page. > > * It works on one of our clusters, but on a second one I get "qsub: Bad Job > Array Request." Does this feature have to be enabled? Or is it only > available from a certain version? The cluster it works on is qsub version: > 2.5.10, and the one it does not work on is?version: 2.4.12. > > > On Wed, Apr 18, 2012 at 11:26 AM, Al Taufer > wrote: >> >> It sounds like slot limits on a job array would do what you want. The >> man page for qsub says >> >> An optional slot limit can be specified to limit the amount ?of jobs >> that ?can ?run concurrently in the job array. The default value is >> unlimited. The slot limit must be the last thing specified in the >> array_request and is delimited from the array by a percent sign (%). >> >> qsub script.sh -t 0-299%5 >> >> This sets the slot limit to 5. Only 5 jobs from this array can run at >> the same time. >> >> On Wed, Apr 18, 2012 at 8:22 AM, Ashish Agarwal >> wrote: >> > I've read some posts about how an admin can limit the maximum number of >> > jobs >> > a user can run. But I'd like to know if a user can him/herself limit the >> > number of jobs *running*, ideally jobs within the same job array. >> > >> > The use case is that I would like to submit 500 I/O intensive jobs. Each >> > job >> > only requires a single core, so potentially all jobs could start running >> > on >> > our cluster overloading the storage system. Thus, I'd like to say >> > "amongst >> > all 500 jobs in this job array, run at most 20 at a time". Is that >> > possible? >> > >> > Thank you. >> > >> > >> > _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> > >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From agarwal1975 at gmail.com Wed Apr 18 10:37:50 2012 From: agarwal1975 at gmail.com (Ashish Agarwal) Date: Wed, 18 Apr 2012 12:37:50 -0400 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: Message-ID: Ok. Thanks for the help! On Wed, Apr 18, 2012 at 12:34 PM, Al Taufer wrote: > It was not in the 2.4 version. It first showed up sometime in the 2.5 > version and should be shown in the 2.5 man pages. > > On Wed, Apr 18, 2012 at 10:20 AM, Ashish Agarwal > wrote: > > The slot limit is exactly what I'm looking for, but: > > > > * It is not mentioned in our man pages. I read now the man page on > > clusterresources.com, which is quite different than our installed man > page. > > > > * It works on one of our clusters, but on a second one I get "qsub: Bad > Job > > Array Request." Does this feature have to be enabled? Or is it only > > available from a certain version? The cluster it works on is qsub > version: > > 2.5.10, and the one it does not work on is version: 2.4.12. > > > > > > On Wed, Apr 18, 2012 at 11:26 AM, Al Taufer < > ataufer at adaptivecomputing.com> > > wrote: > >> > >> It sounds like slot limits on a job array would do what you want. The > >> man page for qsub says > >> > >> An optional slot limit can be specified to limit the amount of jobs > >> that can run concurrently in the job array. The default value is > >> unlimited. The slot limit must be the last thing specified in the > >> array_request and is delimited from the array by a percent sign (%). > >> > >> qsub script.sh -t 0-299%5 > >> > >> This sets the slot limit to 5. Only 5 jobs from this array can run at > >> the same time. > >> > >> On Wed, Apr 18, 2012 at 8:22 AM, Ashish Agarwal > >> wrote: > >> > I've read some posts about how an admin can limit the maximum number > of > >> > jobs > >> > a user can run. But I'd like to know if a user can him/herself limit > the > >> > number of jobs *running*, ideally jobs within the same job array. > >> > > >> > The use case is that I would like to submit 500 I/O intensive jobs. > Each > >> > job > >> > only requires a single core, so potentially all jobs could start > running > >> > on > >> > our cluster overloading the storage system. Thus, I'd like to say > >> > "amongst > >> > all 500 jobs in this job array, run at most 20 at a time". Is that > >> > possible? > >> > > >> > Thank you. > >> > > >> > > >> > _______________________________________________ > >> > torqueusers mailing list > >> > torqueusers at supercluster.org > >> > http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/1bd68797/attachment.html From gus at ldeo.columbia.edu Wed Apr 18 10:40:11 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 18 Apr 2012 12:40:11 -0400 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: Message-ID: <4F8EEE6B.2020503@ldeo.columbia.edu> On 04/18/2012 12:20 PM, Ashish Agarwal wrote: > The slot limit is exactly what I'm looking for, but: > > * It is not mentioned in our man pages. I read now the man page on > clusterresources.com , which is quite > different than our installed man page. > > * It works on one of our clusters, but on a second one I get "qsub: Bad > Job Array Request." Does this feature have to be enabled? Or is it only > available from a certain version? The cluster it works on is qsub > version: 2.5.10, and the one it does not work on is version: 2.4.12. > > As far as I can tell, by looking at 2.4.11 and 2.5.9, this feature is absent in 2.4 and isn't mentioned on its man pages, whereas the 2.5 man page has the "An optional slot limit ... " additional text under '-t array_request'. > On Wed, Apr 18, 2012 at 11:26 AM, Al Taufer > > > wrote: > > It sounds like slot limits on a job array would do what you want. The > man page for qsub says > > An optional slot limit can be specified to limit the amount of jobs > that can run concurrently in the job array. The default value is > unlimited. The slot limit must be the last thing specified in the > array_request and is delimited from the array by a percent sign (%). > > qsub script.sh -t 0-299%5 > > This sets the slot limit to 5. Only 5 jobs from this array can run at > the same time. > > On Wed, Apr 18, 2012 at 8:22 AM, Ashish Agarwal > > wrote: > > I've read some posts about how an admin can limit the maximum > number of jobs > > a user can run. But I'd like to know if a user can him/herself > limit the > > number of jobs *running*, ideally jobs within the same job array. > > > > The use case is that I would like to submit 500 I/O intensive > jobs. Each job > > only requires a single core, so potentially all jobs could start > running on > > our cluster overloading the storage system. Thus, I'd like to say > "amongst > > all 500 jobs in this job array, run at most 20 at a time". Is > that possible? > > > > Thank you. > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Wed Apr 18 13:15:39 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 18 Apr 2012 13:15:39 -0600 Subject: [torqueusers] NUMA -- A first try In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> Message-ID: Randall, You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout. David On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall wrote: > Hi,**** > > ** ** > > I have compiled torque 3.0.4 with NUMA support per this document.**** > > ** ** > > http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml**** > > ** ** > > I have created the server_priv/nodes and mom_priv/mom.layout file**** > > ** ** > > Here are the versions of software:**** > > ** ** > > [root at node11 bin]# pbs_mom -v**** > > version: 3.0.4**** > > ** ** > > [root at mgt1 server_priv]# pbs_server -v**** > > version: 3.0.4**** > > ** ** > > lstopo shows:**** > > ** ** > > [root at node11 bin]# ./lstopo**** > > Machine (24GB)**** > > NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)**** > > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)**** > > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)**** > > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)**** > > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)**** > > L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)**** > > L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)**** > > NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)**** > > L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)**** > > L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)**** > > L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)**** > > L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)**** > > L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)**** > > L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)**** > > ** ** > > Mom.layout:**** > > ** ** > > cpus=0-5 mem=0**** > > cpus=6-11 mem=1**** > > ** ** > > server_priv/nodes:**** > > node11 num_numa_nodes=2 compute**** > > ** ** > > I restart pbs_server on management node and pbs_mom on node11.**** > > pbsnodes ?a shows > > node11-0**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > ** ** > > node11-1**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > ** ** > > ** ** > > mom_log on node11 has:**** > > ** ** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 0**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added** > ** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added > RemChkptDir[0] '/fastscratch/tmp'**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$spool_as_final_name true**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load > started**** > > 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - > nproc=202**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs**** > > 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom > logs in dir '/var/spool/torque/mom_logs' (older than 1 days)**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open > RPP conn to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection > to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 > **** > > 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request > **** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2**** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, > "CLUSTER_ADDRS", received**** > > ** ** > > My problem as illustrated from the pbsnodes command above is that node11 > is down. And running strace on the pbs_mom process does not indicate any > access to the mom.layout file? > > So did I really compile NUMA support? I can see references to NUMA in the > Makefile for torque and the config.log definitely has the right parameters: > **** > > ** ** > > $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support > --disable-gui --enable-blcr --with-default-server=mgt1 > --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp**** > > ** ** > > Can anyone provide further illumination on my already dark dreary day?**** > > ** ** > > Thanks,**** > > ** ** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/9be9765c/attachment.html From rsvancara at wsu.edu Wed Apr 18 13:26:55 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Wed, 18 Apr 2012 19:26:55 +0000 Subject: [torqueusers] NUMA -- A first try In-Reply-To: References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> Message-ID: <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> Hey, good to know that I did do something correct. I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout. 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 12:16 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout. David On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote: Hi, I have compiled torque 3.0.4 with NUMA support per this document. http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml I have created the server_priv/nodes and mom_priv/mom.layout file Here are the versions of software: [root at node11 bin]# pbs_mom -v version: 3.0.4 [root at mgt1 server_priv]# pbs_server -v version: 3.0.4 lstopo shows: [root at node11 bin]# ./lstopo Machine (24GB) NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11) Mom.layout: cpus=0-5 mem=0 cpus=6-11 mem=1 server_priv/nodes: node11 num_numa_nodes=2 compute I restart pbs_server on management node and pbs_mom on node11. pbsnodes -a shows node11-0 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 node11-1 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 mom_log on node11 has: 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp' 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days) 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file? So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters: $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp Can anyone provide further illumination on my already dark dreary day? Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/6cdca52d/attachment-0001.html From dbeer at adaptivecomputing.com Wed Apr 18 15:14:05 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 18 Apr 2012 15:14:05 -0600 Subject: [torqueusers] NUMA -- A first try In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> Message-ID: Randall, After looking closer at your logs, it appears that the pbs_mom binary wasn't numa enabled. If it were, you'd have a message saying: Setting up this mom to function as %d numa nodes - in your case that %d would be a 2. or you'd have one of these error messages: Malformed mom.layout file, line:\n%s\n Unable to read the layout file in %s David On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall wrote: > Hey, good to know that I did do something correct. I have validated the > mom.layout file is in /var/spool/torque/mom_priv/mom.layout.**** > > ** ** > > 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config**** > > 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue**** > > 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh**** > > 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs**** > > 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout**** > > 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak**** > > 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old**** > > 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock**** > > 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue**** > > 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh**** > > 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh**** > > ** ** > > Thanks,**** > > ** ** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > 509-335-3039**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 18, 2012 12:16 PM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] NUMA -- A first try**** > > ** ** > > Randall,**** > > ** ** > > You did compile numa support. You can know this because you get node11-0 > and node11-1 in your pbsnodes output. Is your mom.layout file in the > correct location? It should be in mom_priv/mom.layout.**** > > ** ** > > David**** > > ** ** > > On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote:**** > > Hi,**** > > **** > > I have compiled torque 3.0.4 with NUMA support per this document.**** > > **** > > http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml**** > > **** > > I have created the server_priv/nodes and mom_priv/mom.layout file**** > > **** > > Here are the versions of software:**** > > **** > > [root at node11 bin]# pbs_mom -v**** > > version: 3.0.4**** > > **** > > [root at mgt1 server_priv]# pbs_server -v**** > > version: 3.0.4**** > > **** > > lstopo shows:**** > > **** > > [root at node11 bin]# ./lstopo**** > > Machine (24GB)**** > > NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)**** > > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)**** > > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)**** > > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)**** > > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)**** > > L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)**** > > L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)**** > > NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)**** > > L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)**** > > L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)**** > > L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)**** > > L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)**** > > L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)**** > > L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)**** > > **** > > Mom.layout:**** > > **** > > cpus=0-5 mem=0**** > > cpus=6-11 mem=1**** > > **** > > server_priv/nodes:**** > > node11 num_numa_nodes=2 compute**** > > **** > > I restart pbs_server on management node and pbs_mom on node11.**** > > pbsnodes ?a shows > > node11-0**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > **** > > node11-1**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > **** > > **** > > mom_log on node11 has:**** > > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 0**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added** > ** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added > RemChkptDir[0] '/fastscratch/tmp'**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$spool_as_final_name true**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load > started**** > > 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - > nproc=202**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs**** > > 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom > logs in dir '/var/spool/torque/mom_logs' (older than 1 days)**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open > RPP conn to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection > to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 > **** > > 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request > **** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2**** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, > "CLUSTER_ADDRS", received**** > > **** > > My problem as illustrated from the pbsnodes command above is that node11 > is down. And running strace on the pbs_mom process does not indicate any > access to the mom.layout file? > > So did I really compile NUMA support? I can see references to NUMA in the > Makefile for torque and the config.log definitely has the right parameters: > **** > > **** > > $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support > --disable-gui --enable-blcr --with-default-server=mgt1 > --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp**** > > **** > > Can anyone provide further illumination on my already dark dreary day?**** > > **** > > Thanks,**** > > **** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/eb2a3938/attachment-0001.html From rsvancara at wsu.edu Wed Apr 18 15:46:51 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Wed, 18 Apr 2012 21:46:51 +0000 Subject: [torqueusers] NUMA -- A first try In-Reply-To: References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> Message-ID: <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> Ok, well this gives me starting place at least. Are there additional libraries I need to supply? I build the software the following way: ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp make rpm I believe this is the relevant section out of the config.log. configure:22026: $? = 0 configure:22029: test -s conftest.o configure:22032: $? = 0 configure:22043: result: yes configure:22257: checking whether to allow geometry requests configure:22274: result: no configure:22285: checking whether to support NUMA systems configure:22288: result: yes configure:22313: checking whether to enable libcpuset support configure:22399: result: no configure:22407: checking whether to enable memacct support configure:22416: result: no configure:22510: checking whether add memory alignment flags configure:22517: result: no configure:22604: checking whether to build BLCR support configure:22606: result: yes configure:22627: checking for cr_init in -lcr configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE -DNUMA_SUPPORT -L/usr/lib -lcr conftest.c -lcr >&5 Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 2:14 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, After looking closer at your logs, it appears that the pbs_mom binary wasn't numa enabled. If it were, you'd have a message saying: Setting up this mom to function as %d numa nodes - in your case that %d would be a 2. or you'd have one of these error messages: Malformed mom.layout file, line:\n%s\n Unable to read the layout file in %s David On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall > wrote: Hey, good to know that I did do something correct. I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout. 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 12:16 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout. David On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote: Hi, I have compiled torque 3.0.4 with NUMA support per this document. http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml I have created the server_priv/nodes and mom_priv/mom.layout file Here are the versions of software: [root at node11 bin]# pbs_mom -v version: 3.0.4 [root at mgt1 server_priv]# pbs_server -v version: 3.0.4 lstopo shows: [root at node11 bin]# ./lstopo Machine (24GB) NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11) Mom.layout: cpus=0-5 mem=0 cpus=6-11 mem=1 server_priv/nodes: node11 num_numa_nodes=2 compute I restart pbs_server on management node and pbs_mom on node11. pbsnodes -a shows node11-0 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 node11-1 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 mom_log on node11 has: 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp' 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days) 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file? So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters: $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp Can anyone provide further illumination on my already dark dreary day? Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/575f81a7/attachment-0001.html From dbeer at adaptivecomputing.com Wed Apr 18 15:50:37 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 18 Apr 2012 15:50:37 -0600 Subject: [torqueusers] NUMA -- A first try In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> Message-ID: There are no additional libraries that you need to supply. On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall wrote: > Ok, well this gives me starting place at least. **** > > ** ** > > Are there additional libraries I need to supply? > > I build the software the following way:**** > > ** ** > > ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support > --disable-gui --enable-blcr --with-default-server=mgt1 > --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp**** > > make rpm**** > > ** ** > > I believe this is the relevant section out of the config.log.**** > > ** ** > > configure:22026: $? = 0**** > > configure:22029: test -s conftest.o**** > > configure:22032: $? = 0**** > > configure:22043: result: yes**** > > configure:22257: checking whether to allow geometry requests**** > > configure:22274: result: no**** > > configure:22285: checking whether to support NUMA systems**** > > configure:22288: result: yes**** > > configure:22313: checking whether to enable libcpuset support**** > > configure:22399: result: no**** > > configure:22407: checking whether to enable memacct support**** > > configure:22416: result: no**** > > configure:22510: checking whether add memory alignment flags**** > > configure:22517: result: no**** > > configure:22604: checking whether to build BLCR support**** > > configure:22606: result: yes**** > > configure:22627: checking for cr_init in -lcr**** > > configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE > -DNUMA_SUPPORT -L/usr/lib -lcr conftest.c -lcr >&5**** > > ** ** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > 509-335-3039**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 18, 2012 2:14 PM > > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] NUMA -- A first try**** > > ** ** > > Randall,**** > > ** ** > > After looking closer at your logs, it appears that the pbs_mom binary > wasn't numa enabled. If it were, you'd have a message saying:**** > > ** ** > > Setting up this mom to function as %d numa nodes - in your case that %d > would be a 2.**** > > ** ** > > or you'd have one of these error messages:**** > > ** ** > > Malformed mom.layout file, line:\n%s\n**** > > Unable to read the layout file in %s**** > > ** ** > > David**** > > On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall > wrote:**** > > Hey, good to know that I did do something correct. I have validated the > mom.layout file is in /var/spool/torque/mom_priv/mom.layout.**** > > **** > > 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config**** > > 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue**** > > 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh**** > > 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs**** > > 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout**** > > 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak**** > > 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old**** > > 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock**** > > 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue**** > > 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh**** > > 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh**** > > **** > > Thanks,**** > > **** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > 509-335-3039**** > > **** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 18, 2012 12:16 PM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] NUMA -- A first try**** > > **** > > Randall,**** > > **** > > You did compile numa support. You can know this because you get node11-0 > and node11-1 in your pbsnodes output. Is your mom.layout file in the > correct location? It should be in mom_priv/mom.layout.**** > > **** > > David**** > > **** > > On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote:**** > > Hi,**** > > **** > > I have compiled torque 3.0.4 with NUMA support per this document.**** > > **** > > http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml**** > > **** > > I have created the server_priv/nodes and mom_priv/mom.layout file**** > > **** > > Here are the versions of software:**** > > **** > > [root at node11 bin]# pbs_mom -v**** > > version: 3.0.4**** > > **** > > [root at mgt1 server_priv]# pbs_server -v**** > > version: 3.0.4**** > > **** > > lstopo shows:**** > > **** > > [root at node11 bin]# ./lstopo**** > > Machine (24GB)**** > > NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)**** > > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)**** > > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)**** > > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)**** > > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)**** > > L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)**** > > L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)**** > > NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)**** > > L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)**** > > L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)**** > > L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)**** > > L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)**** > > L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)**** > > L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)**** > > **** > > Mom.layout:**** > > **** > > cpus=0-5 mem=0**** > > cpus=6-11 mem=1**** > > **** > > server_priv/nodes:**** > > node11 num_numa_nodes=2 compute**** > > **** > > I restart pbs_server on management node and pbs_mom on node11.**** > > pbsnodes ?a shows > > node11-0**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > **** > > node11-1**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > **** > > **** > > mom_log on node11 has:**** > > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 0**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added** > ** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added > RemChkptDir[0] '/fastscratch/tmp'**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$spool_as_final_name true**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load > started**** > > 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - > nproc=202**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs**** > > 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom > logs in dir '/var/spool/torque/mom_logs' (older than 1 days)**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open > RPP conn to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection > to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 > **** > > 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request > **** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2**** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, > "CLUSTER_ADDRS", received**** > > **** > > My problem as illustrated from the pbsnodes command above is that node11 > is down. And running strace on the pbs_mom process does not indicate any > access to the mom.layout file? > > So did I really compile NUMA support? I can see references to NUMA in the > Makefile for torque and the config.log definitely has the right parameters: > **** > > **** > > $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support > --disable-gui --enable-blcr --with-default-server=mgt1 > --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp**** > > **** > > Can anyone provide further illumination on my already dark dreary day?**** > > **** > > Thanks,**** > > **** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > **** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/14766b04/attachment-0001.html From rsvancara at wsu.edu Wed Apr 18 17:29:59 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Wed, 18 Apr 2012 23:29:59 +0000 Subject: [torqueusers] NUMA -- A first try In-Reply-To: References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> Message-ID: <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> Just a heads up, the reason NUMA was not being built in the RPM is because the buildutils/torque.spec.in file does not include the %{ac_with_numa} parameter in the ./configure section. Otherwise a regular build would work fine. 187c184 < --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} %{ac_with_numa} \ --- > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ I could exactly figure it out as I am still learning about spec files, but these lines may also need to change: 19c19 < #%bcond_with blcr --- > %bcond_with blcr 25c25 < #%bcond_with numa --- > %bcond_with numa 35,37d34 < %bcond_without bclr < %bcond_without numa < I am not sure if the ./configure overrides these values or not. Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 2:51 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try There are no additional libraries that you need to supply. On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall > wrote: Ok, well this gives me starting place at least. Are there additional libraries I need to supply? I build the software the following way: ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp make rpm I believe this is the relevant section out of the config.log. configure:22026: $? = 0 configure:22029: test -s conftest.o configure:22032: $? = 0 configure:22043: result: yes configure:22257: checking whether to allow geometry requests configure:22274: result: no configure:22285: checking whether to support NUMA systems configure:22288: result: yes configure:22313: checking whether to enable libcpuset support configure:22399: result: no configure:22407: checking whether to enable memacct support configure:22416: result: no configure:22510: checking whether add memory alignment flags configure:22517: result: no configure:22604: checking whether to build BLCR support configure:22606: result: yes configure:22627: checking for cr_init in -lcr configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE -DNUMA_SUPPORT -L/usr/lib -lcr conftest.c -lcr >&5 Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 2:14 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, After looking closer at your logs, it appears that the pbs_mom binary wasn't numa enabled. If it were, you'd have a message saying: Setting up this mom to function as %d numa nodes - in your case that %d would be a 2. or you'd have one of these error messages: Malformed mom.layout file, line:\n%s\n Unable to read the layout file in %s David On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall > wrote: Hey, good to know that I did do something correct. I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout. 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 12:16 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout. David On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote: Hi, I have compiled torque 3.0.4 with NUMA support per this document. http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml I have created the server_priv/nodes and mom_priv/mom.layout file Here are the versions of software: [root at node11 bin]# pbs_mom -v version: 3.0.4 [root at mgt1 server_priv]# pbs_server -v version: 3.0.4 lstopo shows: [root at node11 bin]# ./lstopo Machine (24GB) NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11) Mom.layout: cpus=0-5 mem=0 cpus=6-11 mem=1 server_priv/nodes: node11 num_numa_nodes=2 compute I restart pbs_server on management node and pbs_mom on node11. pbsnodes -a shows node11-0 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 node11-1 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 mom_log on node11 has: 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp' 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days) 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file? So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters: $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp Can anyone provide further illumination on my already dark dreary day? Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/f41031d7/attachment-0001.html From dbeer at adaptivecomputing.com Thu Apr 19 09:15:57 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 19 Apr 2012 09:15:57 -0600 Subject: [torqueusers] NUMA -- A first try In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> Message-ID: Randall, I apologize for the problem. We will log a bug internally and make sure this is corrected. David On Wed, Apr 18, 2012 at 5:29 PM, Svancara, Randall wrote: > Just a heads up, the reason NUMA was not being built in the RPM is > because the buildutils/torque.spec.in file does not include the > %{ac_with_numa} parameter in the ./configure section. Otherwise a regular > build would work fine. **** > > ** ** > > 187c184**** > > < --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} > %{ac_with_numa} \**** > > ---**** > > > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ > **** > > ** ** > > I could exactly figure it out as I am still learning about spec files, but > these lines may also need to change:**** > > ** ** > > ** ** > > 19c19**** > > < #%bcond_with blcr**** > > ---**** > > > %bcond_with blcr**** > > 25c25**** > > < #%bcond_with numa**** > > ---**** > > > %bcond_with numa**** > > 35,37d34**** > > < %bcond_without bclr**** > > < %bcond_without numa**** > > <** ** > > ** ** > > I am not sure if the ./configure overrides these values or not. **** > > ** ** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > 509-335-3039**** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 18, 2012 2:51 PM > > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] NUMA -- A first try**** > > ** ** > > There are no additional libraries that you need to supply.**** > > On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall > wrote:**** > > Ok, well this gives me starting place at least. **** > > **** > > Are there additional libraries I need to supply? > > I build the software the following way:**** > > **** > > ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support > --disable-gui --enable-blcr --with-default-server=mgt1 > --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp**** > > make rpm**** > > **** > > I believe this is the relevant section out of the config.log.**** > > **** > > configure:22026: $? = 0**** > > configure:22029: test -s conftest.o**** > > configure:22032: $? = 0**** > > configure:22043: result: yes**** > > configure:22257: checking whether to allow geometry requests**** > > configure:22274: result: no**** > > configure:22285: checking whether to support NUMA systems**** > > configure:22288: result: yes**** > > configure:22313: checking whether to enable libcpuset support**** > > configure:22399: result: no**** > > configure:22407: checking whether to enable memacct support**** > > configure:22416: result: no**** > > configure:22510: checking whether add memory alignment flags**** > > configure:22517: result: no**** > > configure:22604: checking whether to build BLCR support**** > > configure:22606: result: yes**** > > configure:22627: checking for cr_init in -lcr**** > > configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE > -DNUMA_SUPPORT -L/usr/lib -lcr conftest.c -lcr >&5**** > > **** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > 509-335-3039**** > > **** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 18, 2012 2:14 PM**** > > > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] NUMA -- A first try**** > > **** > > Randall,**** > > **** > > After looking closer at your logs, it appears that the pbs_mom binary > wasn't numa enabled. If it were, you'd have a message saying:**** > > **** > > Setting up this mom to function as %d numa nodes - in your case that %d > would be a 2.**** > > **** > > or you'd have one of these error messages:**** > > **** > > Malformed mom.layout file, line:\n%s\n**** > > Unable to read the layout file in %s**** > > **** > > David**** > > On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall > wrote:**** > > Hey, good to know that I did do something correct. I have validated the > mom.layout file is in /var/spool/torque/mom_priv/mom.layout.**** > > **** > > 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config**** > > 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue**** > > 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh**** > > 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs**** > > 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout**** > > 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak**** > > 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old**** > > 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock**** > > 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue**** > > 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh**** > > 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh**** > > **** > > Thanks,**** > > **** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > 509-335-3039**** > > **** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Wednesday, April 18, 2012 12:16 PM > *To:* Torque Users Mailing List > *Subject:* Re: [torqueusers] NUMA -- A first try**** > > **** > > Randall,**** > > **** > > You did compile numa support. You can know this because you get node11-0 > and node11-1 in your pbsnodes output. Is your mom.layout file in the > correct location? It should be in mom_priv/mom.layout.**** > > **** > > David**** > > **** > > On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote:**** > > Hi,**** > > **** > > I have compiled torque 3.0.4 with NUMA support per this document.**** > > **** > > http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml**** > > **** > > I have created the server_priv/nodes and mom_priv/mom.layout file**** > > **** > > Here are the versions of software:**** > > **** > > [root at node11 bin]# pbs_mom -v**** > > version: 3.0.4**** > > **** > > [root at mgt1 server_priv]# pbs_server -v**** > > version: 3.0.4**** > > **** > > lstopo shows:**** > > **** > > [root at node11 bin]# ./lstopo**** > > Machine (24GB)**** > > NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)**** > > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)**** > > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)**** > > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)**** > > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)**** > > L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)**** > > L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)**** > > NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)**** > > L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)**** > > L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)**** > > L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)**** > > L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)**** > > L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)**** > > L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)**** > > **** > > Mom.layout:**** > > **** > > cpus=0-5 mem=0**** > > cpus=6-11 mem=1**** > > **** > > server_priv/nodes:**** > > node11 num_numa_nodes=2 compute**** > > **** > > I restart pbs_server on management node and pbs_mom on node11.**** > > pbsnodes ?a shows > > node11-0**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > **** > > node11-1**** > > state = down**** > > np = 0**** > > properties = compute**** > > ntype = cluster**** > > mom_service_port = 15002**** > > mom_manager_port = 15003**** > > gpus = 0**** > > **** > > **** > > mom_log on node11 has:**** > > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 0**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added** > ** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added > RemChkptDir[0] '/fastscratch/tmp'**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$usecp *:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line > '$spool_as_final_name true**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent**** > > 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load > started**** > > 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - > nproc=202**** > > 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs**** > > 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 > **** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 3.0.4, loglevel = 7**** > > 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom > logs in dir '/var/spool/torque/mom_logs' (older than 1 days)**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open > RPP conn to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection > to mgt1 port 15001**** > > 04/17/2012 19:45:01;0002; > pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello**** > > 04/17/2012 19:45:01;0008; > pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 > **** > > 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request > **** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2**** > > 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, > "CLUSTER_ADDRS", received**** > > **** > > My problem as illustrated from the pbsnodes command above is that node11 > is down. And running strace on the pbs_mom process does not indicate any > access to the mom.layout file? > > So did I really compile NUMA support? I can see references to NUMA in the > Makefile for torque and the config.log definitely has the right parameters: > **** > > **** > > $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support > --disable-gui --enable-blcr --with-default-server=mgt1 > --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp**** > > **** > > Can anyone provide further illumination on my already dark dreary day?**** > > **** > > Thanks,**** > > **** > > Randall Svancara**** > > High Performance Computing Systems Administrator**** > > Washington State University**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > **** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > **** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > **** > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers**** > > > > **** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/9f1f4c80/attachment-0001.html From leon at if.ufrgs.br Wed Apr 18 16:26:55 2012 From: leon at if.ufrgs.br (Leonardo Gregory Brunnet) Date: Wed, 18 Apr 2012 19:26:55 -0300 Subject: [torqueusers] pbsnodes reports the same job running many times Message-ID: <4F8F3FAF.5000104@if.ufrgs.br> Dear All, In a fresh installed torque/maui cluster the server reports repeated execution of a job in a given node. (There is no job running mpi)!. The output for pbsnodes for one given node gives: node131 state = job-exclusive np = 4 properties = quadcore ntype = cluster jobs = 0/78898.master.cluster.XX.XX.XX, 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, 3/78898.master.XX.XX.XX status = rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC 2007 x86_64,opsys=linux gpus = 0 But, if we log in that node we will see what was expected, a single job. Since the torque server (or maui) "believes" all cpu's of that node are working, no other jobs are sent. Any clues ? Thanks for the help! Leonardo Below, you find the output for # qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue padrao # create queue padrao set queue padrao queue_type = Execution set queue padrao resources_default.nodes = 7 set queue padrao resources_default.walltime = 01:00:00 set queue padrao max_user_run = 5 set queue padrao enabled = True set queue padrao started = True # # Create and define queue um_mes # create queue um_mes set queue um_mes queue_type = Execution set queue um_mes resources_max.nodes = 7 set queue um_mes resources_default.nodes = 7 set queue um_mes resources_default.walltime = 720:00:00 set queue um_mes max_user_run = 5 set queue um_mes enabled = True set queue um_mes started = True # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Create and define queue um_dia # create queue um_dia set queue um_dia queue_type = Execution set queue um_dia resources_max.nodes = 7 set queue um_dia resources_default.nodes = 7 set queue um_dia resources_default.walltime = 24:00:00 set queue um_dia max_user_run = 7 set queue um_dia enabled = True set queue um_dia started = True # # Create and define queue uma_semana # create queue uma_semana set queue uma_semana queue_type = Execution set queue uma_semana resources_max.nodes = 7 set queue uma_semana resources_default.nodes = 7 set queue uma_semana resources_default.walltime = 168:00:00 set queue uma_semana max_user_run = 5 set queue uma_semana enabled = True set queue uma_semana started = True # # Create and define queue route # create queue route set queue route queue_type = Route set queue route route_destinations = padrao set queue route route_destinations += padrao2 set queue route enabled = True set queue route started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = master.cluster.XX.XX.XX set server acl_hosts += clusterapg set server managers = root at master.cluster.XX.XX.XX set server operators = root at master.cluster.XX.XX.XX set server default_queue = padrao set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 79033 -- Leonardo Gregory Brunnet E-mail: leon at if.ufrgs.br Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br 91501-970 Porto Alegre, RS, BRASIL Phone: (51) 33 08 72 51 FAX +55 51 33 08 72 86 C.P. 15051 Linux User: 39314 From rsvancara at wsu.edu Thu Apr 19 09:23:16 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Thu, 19 Apr 2012 15:23:16 +0000 Subject: [torqueusers] NUMA -- A first try In-Reply-To: References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> Message-ID: <1F880D7A2494B346B5AB96481EAE704A15BD9A@EXMB-03.ad.wsu.edu> No worries, I am just happy that I could figure out a problem and contribute to such a great application. If there is something I could do to help file bugs, let me know. Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Thursday, April 19, 2012 8:16 AM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, I apologize for the problem. We will log a bug internally and make sure this is corrected. David On Wed, Apr 18, 2012 at 5:29 PM, Svancara, Randall > wrote: Just a heads up, the reason NUMA was not being built in the RPM is because the buildutils/torque.spec.in file does not include the %{ac_with_numa} parameter in the ./configure section. Otherwise a regular build would work fine. 187c184 < --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} %{ac_with_numa} \ --- > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ I could exactly figure it out as I am still learning about spec files, but these lines may also need to change: 19c19 < #%bcond_with blcr --- > %bcond_with blcr 25c25 < #%bcond_with numa --- > %bcond_with numa 35,37d34 < %bcond_without bclr < %bcond_without numa < I am not sure if the ./configure overrides these values or not. Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 2:51 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try There are no additional libraries that you need to supply. On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall > wrote: Ok, well this gives me starting place at least. Are there additional libraries I need to supply? I build the software the following way: ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp make rpm I believe this is the relevant section out of the config.log. configure:22026: $? = 0 configure:22029: test -s conftest.o configure:22032: $? = 0 configure:22043: result: yes configure:22257: checking whether to allow geometry requests configure:22274: result: no configure:22285: checking whether to support NUMA systems configure:22288: result: yes configure:22313: checking whether to enable libcpuset support configure:22399: result: no configure:22407: checking whether to enable memacct support configure:22416: result: no configure:22510: checking whether add memory alignment flags configure:22517: result: no configure:22604: checking whether to build BLCR support configure:22606: result: yes configure:22627: checking for cr_init in -lcr configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE -DNUMA_SUPPORT -L/usr/lib -lcr conftest.c -lcr >&5 Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 2:14 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, After looking closer at your logs, it appears that the pbs_mom binary wasn't numa enabled. If it were, you'd have a message saying: Setting up this mom to function as %d numa nodes - in your case that %d would be a 2. or you'd have one of these error messages: Malformed mom.layout file, line:\n%s\n Unable to read the layout file in %s David On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall > wrote: Hey, good to know that I did do something correct. I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout. 4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config 4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue 4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh 0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs 4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout 4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak 4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old 4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock 4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue 4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh 4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Wednesday, April 18, 2012 12:16 PM To: Torque Users Mailing List Subject: Re: [torqueusers] NUMA -- A first try Randall, You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout. David On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall > wrote: Hi, I have compiled torque 3.0.4 with NUMA support per this document. http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml I have created the server_priv/nodes and mom_priv/mom.layout file Here are the versions of software: [root at node11 bin]# pbs_mom -v version: 3.0.4 [root at mgt1 server_priv]# pbs_server -v version: 3.0.4 lstopo shows: [root at node11 bin]# ./lstopo Machine (24GB) NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11) Mom.layout: cpus=0-5 mem=0 cpus=6-11 mem=1 server_priv/nodes: node11 num_numa_nodes=2 compute I restart pbs_server on management node and pbs_mom on node11. pbsnodes -a shows node11-0 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 node11-1 state = down np = 0 properties = compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 mom_log on node11 has: 04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0 04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1 04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added 04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp' 04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp 04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch 04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true 04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true 04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started 04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started 04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202 04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up 04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7 04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days) 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001 04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello 04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1 04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2 04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file? So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters: $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp Can anyone provide further illumination on my already dark dreary day? Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/bb46ec53/attachment-0001.html From dbeer at adaptivecomputing.com Thu Apr 19 09:32:05 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 19 Apr 2012 09:32:05 -0600 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F8F3FAF.5000104@if.ufrgs.br> References: <4F8F3FAF.5000104@if.ufrgs.br> Message-ID: What is the qstat -f output for this job? pbs_server reports dedicated resources in pbsnodes output, so if the job requests 4 execution slots and runs one process, pbs_server will report 4 execution slots as occupied. Conversely, if a job asks for one execution slot and uses 4, pbs_server will not scale up what it reports. David On Wed, Apr 18, 2012 at 4:26 PM, Leonardo Gregory Brunnet wrote: > Dear All, > > In a fresh installed torque/maui cluster the server reports > repeated execution of a job in a given node. (There is no job running > mpi)!. > > The output for pbsnodes for one given node gives: > > node131 > state = job-exclusive > np = 4 > properties = quadcore > ntype = cluster > jobs = 0/78898.master.cluster.XX.XX.XX, > 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, > 3/78898.master.XX.XX.XX > status = > rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br > ,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 > 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC > 2007 x86_64,opsys=linux > gpus = 0 > > But, if we log in that node we will see what was expected, a single job. > Since the torque server (or maui) "believes" all cpu's of that node are > working, > no other jobs are sent. Any clues ? > > Thanks for the help! > > Leonardo > > Below, you find the output for > # qmgr -c "p s" > > # > # Create queues and set their attributes. > # > # > # Create and define queue padrao > # > create queue padrao > set queue padrao queue_type = Execution > set queue padrao resources_default.nodes = 7 > set queue padrao resources_default.walltime = 01:00:00 > set queue padrao max_user_run = 5 > set queue padrao enabled = True > set queue padrao started = True > # > # Create and define queue um_mes > # > create queue um_mes > set queue um_mes queue_type = Execution > set queue um_mes resources_max.nodes = 7 > set queue um_mes resources_default.nodes = 7 > set queue um_mes resources_default.walltime = 720:00:00 > set queue um_mes max_user_run = 5 > set queue um_mes enabled = True > set queue um_mes started = True > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Create and define queue um_dia > # > create queue um_dia > set queue um_dia queue_type = Execution > set queue um_dia resources_max.nodes = 7 > set queue um_dia resources_default.nodes = 7 > set queue um_dia resources_default.walltime = 24:00:00 > set queue um_dia max_user_run = 7 > set queue um_dia enabled = True > set queue um_dia started = True > # > # Create and define queue uma_semana > # > create queue uma_semana > set queue uma_semana queue_type = Execution > set queue uma_semana resources_max.nodes = 7 > set queue uma_semana resources_default.nodes = 7 > set queue uma_semana resources_default.walltime = 168:00:00 > set queue uma_semana max_user_run = 5 > set queue uma_semana enabled = True > set queue uma_semana started = True > # > # Create and define queue route > # > create queue route > set queue route queue_type = Route > set queue route route_destinations = padrao > set queue route route_destinations += padrao2 > set queue route enabled = True > set queue route started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = master.cluster.XX.XX.XX > set server acl_hosts += clusterapg > set server managers = root at master.cluster.XX.XX.XX > set server operators = root at master.cluster.XX.XX.XX > set server default_queue = padrao > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 79033 > > -- > Leonardo Gregory Brunnet E-mail: leon at if.ufrgs.br > Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br > 91501-970 Porto Alegre, RS, BRASIL Phone: (51) 33 08 72 51 > FAX +55 51 33 08 72 86 C.P. 15051 > Linux User: 39314 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/2ff0f0d8/attachment.html From gus at ldeo.columbia.edu Thu Apr 19 09:44:44 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 19 Apr 2012 11:44:44 -0400 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F8F3FAF.5000104@if.ufrgs.br> References: <4F8F3FAF.5000104@if.ufrgs.br> Message-ID: <4F9032EC.20909@ldeo.columbia.edu> Hi Leonardo Not sure if I understood the problem right. I guess the job is legitimate and running, but it surprises you that it is using four processors, right? Did the user request four processors, perhaps, even though he/she is running a serial job? #PBS -l nodes=1:ppn=4 This may be reasonable, say, if his/her job needs a lot of RAM, but the job is serial [or if it is Matlab ... the king of memory-greediness ...] Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]: http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php If set to EXACTNODE full nodes will be allocated. I hope this helps, Gus Correa On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote: > Dear All, > > In a fresh installed torque/maui cluster the server reports > repeated execution of a job in a given node. (There is no job running > mpi)!. > > The output for pbsnodes for one given node gives: > > node131 > state = job-exclusive > np = 4 > properties = quadcore > ntype = cluster > jobs = 0/78898.master.cluster.XX.XX.XX, > 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, > 3/78898.master.XX.XX.XX > status = > rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 > 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC > 2007 x86_64,opsys=linux > gpus = 0 > > But, if we log in that node we will see what was expected, a single job. > Since the torque server (or maui) "believes" all cpu's of that node are > working, > no other jobs are sent. Any clues ? > > Thanks for the help! > > Leonardo > > Below, you find the output for > # qmgr -c "p s" > > # > # Create queues and set their attributes. > # > # > # Create and define queue padrao > # > create queue padrao > set queue padrao queue_type = Execution > set queue padrao resources_default.nodes = 7 > set queue padrao resources_default.walltime = 01:00:00 > set queue padrao max_user_run = 5 > set queue padrao enabled = True > set queue padrao started = True > # > # Create and define queue um_mes > # > create queue um_mes > set queue um_mes queue_type = Execution > set queue um_mes resources_max.nodes = 7 > set queue um_mes resources_default.nodes = 7 > set queue um_mes resources_default.walltime = 720:00:00 > set queue um_mes max_user_run = 5 > set queue um_mes enabled = True > set queue um_mes started = True > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Create and define queue um_dia > # > create queue um_dia > set queue um_dia queue_type = Execution > set queue um_dia resources_max.nodes = 7 > set queue um_dia resources_default.nodes = 7 > set queue um_dia resources_default.walltime = 24:00:00 > set queue um_dia max_user_run = 7 > set queue um_dia enabled = True > set queue um_dia started = True > # > # Create and define queue uma_semana > # > create queue uma_semana > set queue uma_semana queue_type = Execution > set queue uma_semana resources_max.nodes = 7 > set queue uma_semana resources_default.nodes = 7 > set queue uma_semana resources_default.walltime = 168:00:00 > set queue uma_semana max_user_run = 5 > set queue uma_semana enabled = True > set queue uma_semana started = True > # > # Create and define queue route > # > create queue route > set queue route queue_type = Route > set queue route route_destinations = padrao > set queue route route_destinations += padrao2 > set queue route enabled = True > set queue route started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = master.cluster.XX.XX.XX > set server acl_hosts += clusterapg > set server managers = root at master.cluster.XX.XX.XX > set server operators = root at master.cluster.XX.XX.XX > set server default_queue = padrao > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 79033 > From bdandrus at nps.edu Thu Apr 19 09:53:00 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 19 Apr 2012 15:53:00 +0000 Subject: [torqueusers] server_managers cannot submit jobs? Message-ID: All, I just upgraded from torque 2.5.10 to 2.5.11 Now when 'I' try to submit even an basic interactive job I get: qsub: Job rejected by all possible destinations (check syntax, queue resources, ...) I tracked it down to the fact I have my account listed for server manager: set server managers = bdandrus@* set server managers += root@* If I take my account out, I am able to submit jobs again. This wasn't an issue in the past, but it may have been a few versions past. I don't want root to be able to submit jobs, so that part was good, but _I_ want to be able to! Any insight would be helpful Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/c93c4cdf/attachment.html From leon at if.ufrgs.br Thu Apr 19 14:07:34 2012 From: leon at if.ufrgs.br (Leonardo Gregory Brunnet) Date: Thu, 19 Apr 2012 17:07:34 -0300 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: References: <4F8F3FAF.5000104@if.ufrgs.br> Message-ID: <4F907086.8010602@if.ufrgs.br> David, Thanks for the reply. I have checked the qsub script, it seems that a single cpu is required #PBS -l ncpus=1) but maybe the correct argument should be #PBS -l nodes=1 qsub script ************************************** #PBS -j oe #PBS -l ncpus=1 #PBS -q um_mes #PBS -N dif_t170_u106_6 work_dir=$PBS_O_WORKDIR cd $work_dir ./script_6 ************************************** But the output for qstat -f (below) seems to indicate that 4 cpu's have been required... Leonardo ************************************* Job Id: 78898.master Job_Name = dif_u91_t170_1 Job_Owner = andressa at master.cluster resources_used.cput = 5124169:16:44 resources_used.mem = 6204kb resources_used.vmem = 52424kb resources_used.walltime = 76:15:25 job_state = R queue = uma_semana server = master.cluster Checkpoint = u ctime = Mon Apr 16 12:35:53 2012 Error_Path = master.cluster:/home103/andressa/ramp_system/dif_ u91_t170_1.e78898 exec_host = node131/3+node131/2+node131/1+node131/0+node123/2+node123/1+no de123/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Mon Apr 16 12:36:40 2012 Output_Path = master.cluster:/home103/andressa/ramp_system/dif _u91_t170_1.o78898 Priority = 0 qtime = Mon Apr 16 12:35:53 2012 Rerunable = True Resource_List.ncpus = 1 Resource_List.neednodes = 7 Resource_List.nodect = 7 Resource_List.nodes = 7 Resource_List.walltime = 168:00:00 session_id = 8224 substate = 42 Variable_List = PBS_O_QUEUE=uma_semana, PBS_O_HOST=master.cluster,PBS_O_HOME=/home103/andressa, PBS_O_LANG=pt_BR.UTF-8,PBS_O_LOGNAME=andressa, PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/ opt/intel/composer_xe_2011_sp1.6.233/bin/intel64/, PBS_O_MAIL=/var/mail/andressa,PBS_O_SHELL=/bin/bash, PBS_SERVER=master.cluster, PBS_O_WORKDIR=/home103/andressa/ramp_system euser = andressa egroup = 1030 hashname = 78898.master.cluster queue_rank = 24 queue_type = E etime = Mon Apr 16 12:35:53 2012 submit_args = ./script_minuano_1 start_time = Mon Apr 16 12:35:54 2012 Walltime.Remaining = 330258 start_count = 1 fault_tolerant = False submit_host = master.cluster init_work_dir = /home103/andressa/ramp_system **************************************** On 19-04-2012 12:32, David Beer wrote: > What is the qstat -f output for this job? pbs_server reports dedicated > resources in pbsnodes output, so if the job requests 4 execution slots > and runs one process, pbs_server will report 4 execution slots as > occupied. Conversely, if a job asks for one execution slot and uses 4, > pbs_server will not scale up what it reports. > > David > > On Wed, Apr 18, 2012 at 4:26 PM, Leonardo Gregory Brunnet > > wrote: > > Dear All, > > In a fresh installed torque/maui cluster the server reports > repeated execution of a job in a given node. (There is no job running > mpi)!. > > The output for pbsnodes for one given node gives: > > node131 > state = job-exclusive > np = 4 > properties = quadcore > ntype = cluster > jobs = 0/78898.master.cluster.XX.XX.XX, > 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, > 3/78898.master.XX.XX.XX > status = > rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br > ,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 > 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC > 2007 x86_64,opsys=linux > gpus = 0 > > But, if we log in that node we will see what was expected, a > single job. > Since the torque server (or maui) "believes" all cpu's of that > node are > working, > no other jobs are sent. Any clues ? > > Thanks for the help! > > Leonardo > > Below, you find the output for > # qmgr -c "p s" > > # > # Create queues and set their attributes. > # > # > # Create and define queue padrao > # > create queue padrao > set queue padrao queue_type = Execution > set queue padrao resources_default.nodes = 7 > set queue padrao resources_default.walltime = 01:00:00 > set queue padrao max_user_run = 5 > set queue padrao enabled = True > set queue padrao started = True > # > # Create and define queue um_mes > # > create queue um_mes > set queue um_mes queue_type = Execution > set queue um_mes resources_max.nodes = 7 > set queue um_mes resources_default.nodes = 7 > set queue um_mes resources_default.walltime = 720:00:00 > set queue um_mes max_user_run = 5 > set queue um_mes enabled = True > set queue um_mes started = True > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Create and define queue um_dia > # > create queue um_dia > set queue um_dia queue_type = Execution > set queue um_dia resources_max.nodes = 7 > set queue um_dia resources_default.nodes = 7 > set queue um_dia resources_default.walltime = 24:00:00 > set queue um_dia max_user_run = 7 > set queue um_dia enabled = True > set queue um_dia started = True > # > # Create and define queue uma_semana > # > create queue uma_semana > set queue uma_semana queue_type = Execution > set queue uma_semana resources_max.nodes = 7 > set queue uma_semana resources_default.nodes = 7 > set queue uma_semana resources_default.walltime = 168:00:00 > set queue uma_semana max_user_run = 5 > set queue uma_semana enabled = True > set queue uma_semana started = True > # > # Create and define queue route > # > create queue route > set queue route queue_type = Route > set queue route route_destinations = padrao > set queue route route_destinations += padrao2 > set queue route enabled = True > set queue route started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = master.cluster.XX.XX.XX > set server acl_hosts += clusterapg > set server managers = root at master.cluster.XX.XX.XX > set server operators = root at master.cluster.XX.XX.XX > set server default_queue = padrao > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 79033 > > -- > Leonardo Gregory Brunnet E-mail: leon at if.ufrgs.br > > Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br > 91501-970 Porto Alegre, RS, BRASIL Phone: (51) 33 08 72 51 > > FAX +55 51 33 08 72 86 > C.P. 15051 > Linux User: 39314 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Leonardo Gregory Brunnet E-mail: leon at if.ufrgs.br Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br 91501-970 Porto Alegre, RS, BRASIL Phone: (51) 33 08 72 51 FAX +55 51 33 08 72 86 C.P. 15051 Linux User: 39314 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/adaf9e26/attachment-0001.html From leon at if.ufrgs.br Thu Apr 19 14:18:37 2012 From: leon at if.ufrgs.br (Leonardo Gregory Brunnet) Date: Thu, 19 Apr 2012 17:18:37 -0300 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F9032EC.20909@ldeo.columbia.edu> References: <4F8F3FAF.5000104@if.ufrgs.br> <4F9032EC.20909@ldeo.columbia.edu> Message-ID: <4F90731D.809@if.ufrgs.br> Hi Gus, Thanks for the answer. Yes, I am surprised that it is using four processors. As previously replied to David the argument used in the qsub script was ... #PBS -l ncpus=1 ... and I suppose this is correct. But in fact I don't know the difference between this one above and #PBS -l nodes=1 I have also checked that in maui.cfg there is no specification for JOBNODEMATCHPOLICY but, in fact I don't know what is the default. If EXACTNODE is the default I should explicitely add a line to maui.cfg, correct? Leonardo On 19-04-2012 12:44, Gus Correa wrote: > Hi Leonardo > > Not sure if I understood the problem right. > I guess the job is legitimate and running, > but it surprises you that it is using four processors, > right? > > Did the user request four processors, perhaps, > even though he/she is running a serial job? > #PBS -l nodes=1:ppn=4 > This may be reasonable, say, if his/her job needs a lot > of RAM, but the job is serial > [or if it is Matlab ... the king of memory-greediness ...] > > Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]: > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > If set to EXACTNODE full nodes will be allocated. > > I hope this helps, > Gus Correa > > On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote: > >> Dear All, >> >> In a fresh installed torque/maui cluster the server reports >> repeated execution of a job in a given node. (There is no job running >> mpi)!. >> >> The output for pbsnodes for one given node gives: >> >> node131 >> state = job-exclusive >> np = 4 >> properties = quadcore >> ntype = cluster >> jobs = 0/78898.master.cluster.XX.XX.XX, >> 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, >> 3/78898.master.XX.XX.XX >> status = >> rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 >> 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC >> 2007 x86_64,opsys=linux >> gpus = 0 >> >> But, if we log in that node we will see what was expected, a single job. >> Since the torque server (or maui) "believes" all cpu's of that node are >> working, >> no other jobs are sent. Any clues ? >> >> Thanks for the help! >> >> Leonardo >> >> Below, you find the output for >> # qmgr -c "p s" >> >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue padrao >> # >> create queue padrao >> set queue padrao queue_type = Execution >> set queue padrao resources_default.nodes = 7 >> set queue padrao resources_default.walltime = 01:00:00 >> set queue padrao max_user_run = 5 >> set queue padrao enabled = True >> set queue padrao started = True >> # >> # Create and define queue um_mes >> # >> create queue um_mes >> set queue um_mes queue_type = Execution >> set queue um_mes resources_max.nodes = 7 >> set queue um_mes resources_default.nodes = 7 >> set queue um_mes resources_default.walltime = 720:00:00 >> set queue um_mes max_user_run = 5 >> set queue um_mes enabled = True >> set queue um_mes started = True >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch resources_default.nodes = 1 >> set queue batch resources_default.walltime = 01:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Create and define queue um_dia >> # >> create queue um_dia >> set queue um_dia queue_type = Execution >> set queue um_dia resources_max.nodes = 7 >> set queue um_dia resources_default.nodes = 7 >> set queue um_dia resources_default.walltime = 24:00:00 >> set queue um_dia max_user_run = 7 >> set queue um_dia enabled = True >> set queue um_dia started = True >> # >> # Create and define queue uma_semana >> # >> create queue uma_semana >> set queue uma_semana queue_type = Execution >> set queue uma_semana resources_max.nodes = 7 >> set queue uma_semana resources_default.nodes = 7 >> set queue uma_semana resources_default.walltime = 168:00:00 >> set queue uma_semana max_user_run = 5 >> set queue uma_semana enabled = True >> set queue uma_semana started = True >> # >> # Create and define queue route >> # >> create queue route >> set queue route queue_type = Route >> set queue route route_destinations = padrao >> set queue route route_destinations += padrao2 >> set queue route enabled = True >> set queue route started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = master.cluster.XX.XX.XX >> set server acl_hosts += clusterapg >> set server managers = root at master.cluster.XX.XX.XX >> set server operators = root at master.cluster.XX.XX.XX >> set server default_queue = padrao >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server mom_job_sync = True >> set server keep_completed = 300 >> set server next_job_number = 79033 >> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Leonardo Gregory Brunnet E-mail: leon at if.ufrgs.br Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br 91501-970 Porto Alegre, RS, BRASIL Phone: (51) 33 08 72 51 FAX +55 51 33 08 72 86 C.P. 15051 Linux User: 39314 From mej at lbl.gov Thu Apr 19 15:01:21 2012 From: mej at lbl.gov (Michael Jennings) Date: Thu, 19 Apr 2012 14:01:21 -0700 Subject: [torqueusers] NUMA -- A first try In-Reply-To: <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> Message-ID: <20120419210119.GI9750@lbl.gov> On Wednesday, 18 April 2012, at 23:29:59 (+0000), Svancara, Randall wrote: > Just a heads up, the reason NUMA was not being built in the RPM is > because the buildutils/torque.spec.in file does not include the > %{ac_with_numa} parameter in the ./configure section. Otherwise a > regular build would work fine. > > 187c184 > < --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} %{ac_with_numa} \ > --- > > --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \ That's one reason. The other reason is that "--with numa" isn't supported by "make rpm" yet. Only a small subset of the available --with and --without rpmbuild options are supported by "make rpm" as this is not the standard, "accepted" method of building RPM packages. The "make rpm" system is there solely for people who would, for whatever reason, rather specify all their customizations at ./configure time instead of rpmbuild time. It's use is not recommended in the general case. Instead, use rpmbuild directly, e.g.: ./configure make dist rpmbuild -ta --with numa --with blcr torque-4.0.1.tar.gz If you really do want to use "make rpm" anyway, support will need to be added to configure.ac for it. Look for occurances of RPM_AC_OPTS in configure.ac to see how the existing support was done. > I could exactly figure it out as I am still learning about spec > files, but these lines may also need to change: > > > 19c19 > < #%bcond_with blcr > --- > > %bcond_with blcr > 25c25 > < #%bcond_with numa > --- > > %bcond_with numa > 35,37d34 > < %bcond_without bclr > < %bcond_without numa > < No, you don't want to do that. Those should not be on by default. And you don't want to comment out macros using #. While some macros can be disabled this way, many cannot (the most common troublemaker being %patch). Either double the percent sign or remove it before adding the pound sign: #%%macro or #macro HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From gus at ldeo.columbia.edu Thu Apr 19 15:08:24 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 19 Apr 2012 17:08:24 -0400 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F90731D.809@if.ufrgs.br> References: <4F8F3FAF.5000104@if.ufrgs.br> <4F9032EC.20909@ldeo.columbia.edu> <4F90731D.809@if.ufrgs.br> Message-ID: <4F907EC8.6030806@ldeo.columbia.edu> Hi Leonardo On 04/19/2012 04:18 PM, Leonardo Gregory Brunnet wrote: > Hi Gus, > > Thanks for the answer. > > Yes, I am surprised that it is using four processors. > As previously replied to David the argument used in the qsub script was > ... > #PBS -l ncpus=1 > ... Somebody may correct me, but I think ncpus is a Moab thing, which may or may not work right with Torque+Maui. If you search this mailing list you will find other postings about ncpus. Here we don't use ncpus. We stick to the 'nodes=X:ppn=Y' syntax. It works for us. > and I suppose this is correct. But in fact I don't know the difference > between this one > above and > #PBS -l nodes=1 > > I have also checked that in maui.cfg there is no specification for > > JOBNODEMATCHPOLICY > > but, in fact I don't know what is the default. If EXACTNODE is the default > I should explicitely add a line to maui.cfg, correct? > Check JOBNODEMATCHPOLICY in the Maui Admin guide, although it doesn't tell the default. http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php You can add the line with your option for JOBNODEMATCHPOLICY to maui.cfg and restart maui. We use EXACTNODE here. Gus Correa > Leonardo > > On 19-04-2012 12:44, Gus Correa wrote: >> Hi Leonardo >> >> Not sure if I understood the problem right. >> I guess the job is legitimate and running, >> but it surprises you that it is using four processors, >> right? >> >> Did the user request four processors, perhaps, >> even though he/she is running a serial job? >> #PBS -l nodes=1:ppn=4 >> This may be reasonable, say, if his/her job needs a lot >> of RAM, but the job is serial >> [or if it is Matlab ... the king of memory-greediness ...] >> >> Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]: >> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php >> If set to EXACTNODE full nodes will be allocated. >> >> I hope this helps, >> Gus Correa >> >> On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote: >> >>> Dear All, >>> >>> In a fresh installed torque/maui cluster the server reports >>> repeated execution of a job in a given node. (There is no job running >>> mpi)!. >>> >>> The output for pbsnodes for one given node gives: >>> >>> node131 >>> state = job-exclusive >>> np = 4 >>> properties = quadcore >>> ntype = cluster >>> jobs = 0/78898.master.cluster.XX.XX.XX, >>> 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, >>> 3/78898.master.XX.XX.XX >>> status = >>> rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 >>> 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC >>> 2007 x86_64,opsys=linux >>> gpus = 0 >>> >>> But, if we log in that node we will see what was expected, a single job. >>> Since the torque server (or maui) "believes" all cpu's of that node are >>> working, >>> no other jobs are sent. Any clues ? >>> >>> Thanks for the help! >>> >>> Leonardo >>> >>> Below, you find the output for >>> # qmgr -c "p s" >>> >>> # >>> # Create queues and set their attributes. >>> # >>> # >>> # Create and define queue padrao >>> # >>> create queue padrao >>> set queue padrao queue_type = Execution >>> set queue padrao resources_default.nodes = 7 >>> set queue padrao resources_default.walltime = 01:00:00 >>> set queue padrao max_user_run = 5 >>> set queue padrao enabled = True >>> set queue padrao started = True >>> # >>> # Create and define queue um_mes >>> # >>> create queue um_mes >>> set queue um_mes queue_type = Execution >>> set queue um_mes resources_max.nodes = 7 >>> set queue um_mes resources_default.nodes = 7 >>> set queue um_mes resources_default.walltime = 720:00:00 >>> set queue um_mes max_user_run = 5 >>> set queue um_mes enabled = True >>> set queue um_mes started = True >>> # >>> # Create and define queue batch >>> # >>> create queue batch >>> set queue batch queue_type = Execution >>> set queue batch resources_default.nodes = 1 >>> set queue batch resources_default.walltime = 01:00:00 >>> set queue batch enabled = True >>> set queue batch started = True >>> # >>> # Create and define queue um_dia >>> # >>> create queue um_dia >>> set queue um_dia queue_type = Execution >>> set queue um_dia resources_max.nodes = 7 >>> set queue um_dia resources_default.nodes = 7 >>> set queue um_dia resources_default.walltime = 24:00:00 >>> set queue um_dia max_user_run = 7 >>> set queue um_dia enabled = True >>> set queue um_dia started = True >>> # >>> # Create and define queue uma_semana >>> # >>> create queue uma_semana >>> set queue uma_semana queue_type = Execution >>> set queue uma_semana resources_max.nodes = 7 >>> set queue uma_semana resources_default.nodes = 7 >>> set queue uma_semana resources_default.walltime = 168:00:00 >>> set queue uma_semana max_user_run = 5 >>> set queue uma_semana enabled = True >>> set queue uma_semana started = True >>> # >>> # Create and define queue route >>> # >>> create queue route >>> set queue route queue_type = Route >>> set queue route route_destinations = padrao >>> set queue route route_destinations += padrao2 >>> set queue route enabled = True >>> set queue route started = True >>> # >>> # Set server attributes. >>> # >>> set server scheduling = True >>> set server acl_hosts = master.cluster.XX.XX.XX >>> set server acl_hosts += clusterapg >>> set server managers = root at master.cluster.XX.XX.XX >>> set server operators = root at master.cluster.XX.XX.XX >>> set server default_queue = padrao >>> set server log_events = 511 >>> set server mail_from = adm >>> set server scheduler_iteration = 600 >>> set server node_check_rate = 150 >>> set server tcp_timeout = 6 >>> set server mom_job_sync = True >>> set server keep_completed = 300 >>> set server next_job_number = 79033 >>> >>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > From leon at if.ufrgs.br Thu Apr 19 16:25:55 2012 From: leon at if.ufrgs.br (Leonardo Gregory Brunnet) Date: Thu, 19 Apr 2012 19:25:55 -0300 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F907EC8.6030806@ldeo.columbia.edu> References: <4F8F3FAF.5000104@if.ufrgs.br> <4F9032EC.20909@ldeo.columbia.edu> <4F90731D.809@if.ufrgs.br> <4F907EC8.6030806@ldeo.columbia.edu> Message-ID: <4F9090F3.1020707@if.ufrgs.br> Hi Gus, Problem solved using simply "nodes=X". Thanks for all suggestions! Leonardo P.S. We never had Moab here... ncpus appeared probably from some foreign script ;) . On 19-04-2012 18:08, Gus Correa wrote: > Hi Leonardo > > On 04/19/2012 04:18 PM, Leonardo Gregory Brunnet wrote: > >> Hi Gus, >> >> Thanks for the answer. >> >> Yes, I am surprised that it is using four processors. >> As previously replied to David the argument used in the qsub script was >> ... >> #PBS -l ncpus=1 >> ... >> > Somebody may correct me, but I think ncpus is a Moab thing, > which may or may not work right with Torque+Maui. > If you search this mailing list you will find other postings > about ncpus. > > Here we don't use ncpus. > We stick to the 'nodes=X:ppn=Y' syntax. > It works for us. > > >> and I suppose this is correct. But in fact I don't know the difference >> between this one >> above and >> #PBS -l nodes=1 >> >> I have also checked that in maui.cfg there is no specification for >> >> JOBNODEMATCHPOLICY >> >> but, in fact I don't know what is the default. If EXACTNODE is the default >> I should explicitely add a line to maui.cfg, correct? >> >> > Check JOBNODEMATCHPOLICY in the Maui Admin guide, although it > doesn't tell the default. > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > You can add the line with your option for JOBNODEMATCHPOLICY > to maui.cfg and restart maui. > We use EXACTNODE here. > > Gus Correa > > >> Leonardo >> >> On 19-04-2012 12:44, Gus Correa wrote: >> >>> Hi Leonardo >>> >>> Not sure if I understood the problem right. >>> I guess the job is legitimate and running, >>> but it surprises you that it is using four processors, >>> right? >>> >>> Did the user request four processors, perhaps, >>> even though he/she is running a serial job? >>> #PBS -l nodes=1:ppn=4 >>> This may be reasonable, say, if his/her job needs a lot >>> of RAM, but the job is serial >>> [or if it is Matlab ... the king of memory-greediness ...] >>> >>> Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]: >>> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php >>> If set to EXACTNODE full nodes will be allocated. >>> >>> I hope this helps, >>> Gus Correa >>> >>> On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote: >>> >>> >>>> Dear All, >>>> >>>> In a fresh installed torque/maui cluster the server reports >>>> repeated execution of a job in a given node. (There is no job running >>>> mpi)!. >>>> >>>> The output for pbsnodes for one given node gives: >>>> >>>> node131 >>>> state = job-exclusive >>>> np = 4 >>>> properties = quadcore >>>> ntype = cluster >>>> jobs = 0/78898.master.cluster.XX.XX.XX, >>>> 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, >>>> 3/78898.master.XX.XX.XX >>>> status = >>>> rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 >>>> 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC >>>> 2007 x86_64,opsys=linux >>>> gpus = 0 >>>> >>>> But, if we log in that node we will see what was expected, a single job. >>>> Since the torque server (or maui) "believes" all cpu's of that node are >>>> working, >>>> no other jobs are sent. Any clues ? >>>> >>>> Thanks for the help! >>>> >>>> Leonardo >>>> >>>> Below, you find the output for >>>> # qmgr -c "p s" >>>> >>>> # >>>> # Create queues and set their attributes. >>>> # >>>> # >>>> # Create and define queue padrao >>>> # >>>> create queue padrao >>>> set queue padrao queue_type = Execution >>>> set queue padrao resources_default.nodes = 7 >>>> set queue padrao resources_default.walltime = 01:00:00 >>>> set queue padrao max_user_run = 5 >>>> set queue padrao enabled = True >>>> set queue padrao started = True >>>> # >>>> # Create and define queue um_mes >>>> # >>>> create queue um_mes >>>> set queue um_mes queue_type = Execution >>>> set queue um_mes resources_max.nodes = 7 >>>> set queue um_mes resources_default.nodes = 7 >>>> set queue um_mes resources_default.walltime = 720:00:00 >>>> set queue um_mes max_user_run = 5 >>>> set queue um_mes enabled = True >>>> set queue um_mes started = True >>>> # >>>> # Create and define queue batch >>>> # >>>> create queue batch >>>> set queue batch queue_type = Execution >>>> set queue batch resources_default.nodes = 1 >>>> set queue batch resources_default.walltime = 01:00:00 >>>> set queue batch enabled = True >>>> set queue batch started = True >>>> # >>>> # Create and define queue um_dia >>>> # >>>> create queue um_dia >>>> set queue um_dia queue_type = Execution >>>> set queue um_dia resources_max.nodes = 7 >>>> set queue um_dia resources_default.nodes = 7 >>>> set queue um_dia resources_default.walltime = 24:00:00 >>>> set queue um_dia max_user_run = 7 >>>> set queue um_dia enabled = True >>>> set queue um_dia started = True >>>> # >>>> # Create and define queue uma_semana >>>> # >>>> create queue uma_semana >>>> set queue uma_semana queue_type = Execution >>>> set queue uma_semana resources_max.nodes = 7 >>>> set queue uma_semana resources_default.nodes = 7 >>>> set queue uma_semana resources_default.walltime = 168:00:00 >>>> set queue uma_semana max_user_run = 5 >>>> set queue uma_semana enabled = True >>>> set queue uma_semana started = True >>>> # >>>> # Create and define queue route >>>> # >>>> create queue route >>>> set queue route queue_type = Route >>>> set queue route route_destinations = padrao >>>> set queue route route_destinations += padrao2 >>>> set queue route enabled = True >>>> set queue route started = True >>>> # >>>> # Set server attributes. >>>> # >>>> set server scheduling = True >>>> set server acl_hosts = master.cluster.XX.XX.XX >>>> set server acl_hosts += clusterapg >>>> set server managers = root at master.cluster.XX.XX.XX >>>> set server operators = root at master.cluster.XX.XX.XX >>>> set server default_queue = padrao >>>> set server log_events = 511 >>>> set server mail_from = adm >>>> set server scheduler_iteration = 600 >>>> set server node_check_rate = 150 >>>> set server tcp_timeout = 6 >>>> set server mom_job_sync = True >>>> set server keep_completed = 300 >>>> set server next_job_number = 79033 >>>> >>>> >>>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >>> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Leonardo Gregory Brunnet E-mail: leon at if.ufrgs.br Instituto de Fisica - UFRGS http://pcleon.if.ufrgs.br 91501-970 Porto Alegre, RS, BRASIL Phone: (51) 33 08 72 51 FAX +55 51 33 08 72 86 C.P. 15051 Linux User: 39314 From gus at ldeo.columbia.edu Thu Apr 19 16:47:35 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 19 Apr 2012 18:47:35 -0400 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F9090F3.1020707@if.ufrgs.br> References: <4F8F3FAF.5000104@if.ufrgs.br> <4F9032EC.20909@ldeo.columbia.edu> <4F90731D.809@if.ufrgs.br> <4F907EC8.6030806@ldeo.columbia.edu> <4F9090F3.1020707@if.ufrgs.br> Message-ID: <4F909607.9070400@ldeo.columbia.edu> On 04/19/2012 06:25 PM, Leonardo Gregory Brunnet wrote: > Hi Gus, > > Problem solved using simply "nodes=X". > Thanks for all suggestions! > > Leonardo > P.S. We never had Moab here... ncpus appeared probably > from some foreign script ;) . > We also use Torque+Maui here. I don't remember exactly, but ncpus may work under the barebones Torque/PBS scheduler pbs_sched, besides Moab. 'ncpus' seems to be a bit troublesome with Maui, though. The easy solution is to ask the users to stick to the 'nodes=X' syntax. In a more elaborate solution you can write a qsub wrapper to replace 'ncpus' the the 'nodes' and 'ppn' syntax. Gus Correa > On 19-04-2012 18:08, Gus Correa wrote: >> Hi Leonardo >> >> On 04/19/2012 04:18 PM, Leonardo Gregory Brunnet wrote: >> >>> Hi Gus, >>> >>> Thanks for the answer. >>> >>> Yes, I am surprised that it is using four processors. >>> As previously replied to David the argument used in the qsub script was >>> ... >>> #PBS -l ncpus=1 >>> ... >>> >> Somebody may correct me, but I think ncpus is a Moab thing, >> which may or may not work right with Torque+Maui. >> If you search this mailing list you will find other postings >> about ncpus. >> >> Here we don't use ncpus. >> We stick to the 'nodes=X:ppn=Y' syntax. >> It works for us. >> >> >>> and I suppose this is correct. But in fact I don't know the difference >>> between this one >>> above and >>> #PBS -l nodes=1 >>> >>> I have also checked that in maui.cfg there is no specification for >>> >>> JOBNODEMATCHPOLICY >>> >>> but, in fact I don't know what is the default. If EXACTNODE is the default >>> I should explicitely add a line to maui.cfg, correct? >>> >>> >> Check JOBNODEMATCHPOLICY in the Maui Admin guide, although it >> doesn't tell the default. >> >> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php >> >> You can add the line with your option for JOBNODEMATCHPOLICY >> to maui.cfg and restart maui. >> We use EXACTNODE here. >> >> Gus Correa >> >> >>> Leonardo >>> >>> On 19-04-2012 12:44, Gus Correa wrote: >>> >>>> Hi Leonardo >>>> >>>> Not sure if I understood the problem right. >>>> I guess the job is legitimate and running, >>>> but it surprises you that it is using four processors, >>>> right? >>>> >>>> Did the user request four processors, perhaps, >>>> even though he/she is running a serial job? >>>> #PBS -l nodes=1:ppn=4 >>>> This may be reasonable, say, if his/her job needs a lot >>>> of RAM, but the job is serial >>>> [or if it is Matlab ... the king of memory-greediness ...] >>>> >>>> Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]: >>>> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php >>>> If set to EXACTNODE full nodes will be allocated. >>>> >>>> I hope this helps, >>>> Gus Correa >>>> >>>> On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote: >>>> >>>> >>>>> Dear All, >>>>> >>>>> In a fresh installed torque/maui cluster the server reports >>>>> repeated execution of a job in a given node. (There is no job running >>>>> mpi)!. >>>>> >>>>> The output for pbsnodes for one given node gives: >>>>> >>>>> node131 >>>>> state = job-exclusive >>>>> np = 4 >>>>> properties = quadcore >>>>> ntype = cluster >>>>> jobs = 0/78898.master.cluster.XX.XX.XX, >>>>> 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, >>>>> 3/78898.master.XX.XX.XX >>>>> status = >>>>> rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 >>>>> 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC >>>>> 2007 x86_64,opsys=linux >>>>> gpus = 0 >>>>> >>>>> But, if we log in that node we will see what was expected, a single job. >>>>> Since the torque server (or maui) "believes" all cpu's of that node are >>>>> working, >>>>> no other jobs are sent. Any clues ? >>>>> >>>>> Thanks for the help! >>>>> >>>>> Leonardo >>>>> >>>>> Below, you find the output for >>>>> # qmgr -c "p s" >>>>> >>>>> # >>>>> # Create queues and set their attributes. >>>>> # >>>>> # >>>>> # Create and define queue padrao >>>>> # >>>>> create queue padrao >>>>> set queue padrao queue_type = Execution >>>>> set queue padrao resources_default.nodes = 7 >>>>> set queue padrao resources_default.walltime = 01:00:00 >>>>> set queue padrao max_user_run = 5 >>>>> set queue padrao enabled = True >>>>> set queue padrao started = True >>>>> # >>>>> # Create and define queue um_mes >>>>> # >>>>> create queue um_mes >>>>> set queue um_mes queue_type = Execution >>>>> set queue um_mes resources_max.nodes = 7 >>>>> set queue um_mes resources_default.nodes = 7 >>>>> set queue um_mes resources_default.walltime = 720:00:00 >>>>> set queue um_mes max_user_run = 5 >>>>> set queue um_mes enabled = True >>>>> set queue um_mes started = True >>>>> # >>>>> # Create and define queue batch >>>>> # >>>>> create queue batch >>>>> set queue batch queue_type = Execution >>>>> set queue batch resources_default.nodes = 1 >>>>> set queue batch resources_default.walltime = 01:00:00 >>>>> set queue batch enabled = True >>>>> set queue batch started = True >>>>> # >>>>> # Create and define queue um_dia >>>>> # >>>>> create queue um_dia >>>>> set queue um_dia queue_type = Execution >>>>> set queue um_dia resources_max.nodes = 7 >>>>> set queue um_dia resources_default.nodes = 7 >>>>> set queue um_dia resources_default.walltime = 24:00:00 >>>>> set queue um_dia max_user_run = 7 >>>>> set queue um_dia enabled = True >>>>> set queue um_dia started = True >>>>> # >>>>> # Create and define queue uma_semana >>>>> # >>>>> create queue uma_semana >>>>> set queue uma_semana queue_type = Execution >>>>> set queue uma_semana resources_max.nodes = 7 >>>>> set queue uma_semana resources_default.nodes = 7 >>>>> set queue uma_semana resources_default.walltime = 168:00:00 >>>>> set queue uma_semana max_user_run = 5 >>>>> set queue uma_semana enabled = True >>>>> set queue uma_semana started = True >>>>> # >>>>> # Create and define queue route >>>>> # >>>>> create queue route >>>>> set queue route queue_type = Route >>>>> set queue route route_destinations = padrao >>>>> set queue route route_destinations += padrao2 >>>>> set queue route enabled = True >>>>> set queue route started = True >>>>> # >>>>> # Set server attributes. >>>>> # >>>>> set server scheduling = True >>>>> set server acl_hosts = master.cluster.XX.XX.XX >>>>> set server acl_hosts += clusterapg >>>>> set server managers = root at master.cluster.XX.XX.XX >>>>> set server operators = root at master.cluster.XX.XX.XX >>>>> set server default_queue = padrao >>>>> set server log_events = 511 >>>>> set server mail_from = adm >>>>> set server scheduler_iteration = 600 >>>>> set server node_check_rate = 150 >>>>> set server tcp_timeout = 6 >>>>> set server mom_job_sync = True >>>>> set server keep_completed = 300 >>>>> set server next_job_number = 79033 >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>>> >>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > From sean.reilly at ersa.edu.au Thu Apr 19 23:54:34 2012 From: sean.reilly at ersa.edu.au (Sean Reilly) Date: Fri, 20 Apr 2012 05:54:34 +0000 (UTC) Subject: [torqueusers] [Patch] GPUs by the way of GRES References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> <20309.169.908792.976630@gargle.gargle.HOWL> Message-ID: Hi Folks We have just begun testing with this Patch - Great Work by Jonathan Michalon ! So far it is doing everything we need. Roland - as you still need to lock the assigned GPU's down to a particular user and Job ID on the backend nodes - we do this with cuda_wrapper there is no real need for Maui to specify the particular gpu eg gpu/2 We use both the torque #PBS -l gpus=1 and the Maui #PBS -W x=GRES:gpu at 1 Maui side+Patch takes care of the number of gpu's being available Torque gives you the environment variable PBS_RESOURCE_GRES=gpus=1 The prologue script is responsible for assigning an available gpu to this user and JobID. When job finishes or is killed - epilogue release the gpu back into the pool. These two scripts should be aware of the gpus avail and in use at any time. - As Maui has ensured they should be available. *if not then the prologue and epilogue can send admin an Error email so it can be checked.* Its early days for us - but so far so good. But yes it would be a nice if Maui could tell the backend nodes about the number of GPU's assigned (and possibly the device number) : eliminating the need for the extra #PBS -l gpus=1 setting. But not a show stopper. Regards Sean From sean.reilly at ersa.edu.au Fri Apr 20 00:16:00 2012 From: sean.reilly at ersa.edu.au (Sean Reilly) Date: Fri, 20 Apr 2012 15:46:00 +0930 Subject: [torqueusers] [Patch] GPUs by the way of GRES In-Reply-To: <20309.169.908792.976630@gargle.gargle.HOWL> References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> <20309.169.908792.976630@gargle.gargle.HOWL> Message-ID: <4F90FF20.4000308@ersa.edu.au> Hi Folks Here at eRSA we have just begun testing with this Patch - Great Work by Jonathan Michalon ! So far it is doing everything we need. Roland - as you still need to lock the assigned GPU's down to a particular user and Job ID on the backend nodes - we do this with cuda_wrapper so there is no real need for Maui to specify the particular gpu eg gpu/2 or gpu_2 (apart form it just being a bit cleaner) We use both the Torque and Maui directives: torque #PBS -l gpus=1 maui #PBS -W x=GRES:gpu at 1 Maui side directive+Patch takes care of the number of gpu's actually being available Torque gives you the environment variable PBS_RESOURCE_GRES=gpus=1 The prologue script is responsible for assigning an available gpu to this user and JobID. via wrapper_init When the job finishes or is killed - epilogue release the gpu back into the pool. via wrapper_terminate These two scripts should be aware of the gpus avail and in use at any time. - As Maui has ensured they should be available. *if not then the prologue and epilogue can send admins an Error email so it can be checked.* Its still early days for us - but so far so good. But yes it would be a nice if Maui could tell the backend nodes about the number of GPU's assigned (and possibly the device number) : eliminating the need for the extra #PBS -l gpus=1 setting. But not a show stopper. Regards Sean On 06/03/12 04:36, rf at q-leap.de wrote: >>>>>> "Jonathan" == Jonathan Michalon writes: > Hi Jonathan, > > while your patch adds some functionality to count allocated GPUs as > a GRES, it lacks the important functionality to tell the job which GPUs > are available for it. If latest torque 2.5.x is built with GPU support, > you have the option to specify a nodes spec like "-l nodes=1:gpus=1" and > within the running job you can check $GPUFILE what GPUs you're > allocated. Now the problem is that a job with a "-l nodes=1:gpus=1" > specification won't be started with maui even if it has your patch. On > the other hand, using your "-W x=GRES:gpu at 1" spec (without a "-l > nodes=1:gpus=1" spec) makes the job run, but > it doesn't have an idea which GPU to use. Is there an easy way to extend > your patch, so that maui will make a job run with the "-l > nodes=1:gpus=1" spec? > > Cheers, > > Roland > > Jonathan> Hi Maui folks, GPUs in Maui are a long standing > Jonathan> problem. Last year a patch was sent by Mariusz Mamo?ski > Jonathan> [1], which works based on GRES parameters. I've just made > Jonathan> GPUs kind of working, by enhancing that patch. Please find > Jonathan> attached the resulting patch, which works well for Maui > Jonathan> 3.3.1. It defines a special GRES named "gpu" which works > Jonathan> as expected on my test cases. > > Jonathan> Note that GRES behaviour seems quite confused as sometimes > Jonathan> they are mentioned as consumable. This patch annihilates > Jonathan> this behaviour, for the needs of GPUs. > > Jonathan> To use the patch: get the sources of maui-3.3.1 and patch > Jonathan> them: patch -p1< ../Patch-for-gpu-GRES.patch then compile > Jonathan> as usual. > > Jonathan> You have to configure the GPUs in maui.cfg: > Jonathan> NODECFG[nodename] GRES=gpu:2 > > Jonathan> Then when queuing jobs you can request GPUs with (Torque > Jonathan> syntax): qsub -W x=GRES:gpu at 1 > > Jonathan> I hope this helps, please test this and enhance to your > Jonathan> needs! > > Jonathan> [1] > Jonathan> http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html > > Jonathan> Regards, > > Jonathan> PS. This is the second attempt to send the mail? > > Jonathan> -- Jonathan Michalon IT student in Strasbourg > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- *Sean Reilly* Systems Administrator & Applications Support Officer eResearchSA Phone : +61 8 8313 8352 Mobile: +61 450 840 246 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/5a189bd5/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: Email-moved.png Type: image/png Size: 10004 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/5a189bd5/attachment-0001.png From rsvancara at wsu.edu Fri Apr 20 08:19:59 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Fri, 20 Apr 2012 14:19:59 +0000 Subject: [torqueusers] pbsnodes reports the same job running many times In-Reply-To: <4F909607.9070400@ldeo.columbia.edu> References: <4F8F3FAF.5000104@if.ufrgs.br> <4F9032EC.20909@ldeo.columbia.edu> <4F90731D.809@if.ufrgs.br> <4F907EC8.6030806@ldeo.columbia.edu> <4F9090F3.1020707@if.ufrgs.br>,<4F909607.9070400@ldeo.columbia.edu> Message-ID: Sent from my Verizon Wireless 4GLTE smartphone ----- Reply message ----- From: "Gus Correa" To: "Torque Users Mailing List" Subject: [torqueusers] pbsnodes reports the same job running many times Date: Thu, Apr 19, 2012 3:47 pm On 04/19/2012 06:25 PM, Leonardo Gregory Brunnet wrote: > Hi Gus, > > Problem solved using simply "nodes=X". > Thanks for all suggestions! > > Leonardo > P.S. We never had Moab here... ncpus appeared probably > from some foreign script ;) . > We also use Torque+Maui here. I don't remember exactly, but ncpus may work under the barebones Torque/PBS scheduler pbs_sched, besides Moab. 'ncpus' seems to be a bit troublesome with Maui, though. The easy solution is to ask the users to stick to the 'nodes=X' syntax. In a more elaborate solution you can write a qsub wrapper to replace 'ncpus' the the 'nodes' and 'ppn' syntax. Gus Correa > On 19-04-2012 18:08, Gus Correa wrote: >> Hi Leonardo >> >> On 04/19/2012 04:18 PM, Leonardo Gregory Brunnet wrote: >> >>> Hi Gus, >>> >>> Thanks for the answer. >>> >>> Yes, I am surprised that it is using four processors. >>> As previously replied to David the argument used in the qsub script was >>> ... >>> #PBS -l ncpus=1 >>> ... >>> >> Somebody may correct me, but I think ncpus is a Moab thing, >> which may or may not work right with Torque+Maui. >> If you search this mailing list you will find other postings >> about ncpus. >> >> Here we don't use ncpus. >> We stick to the 'nodes=X:ppn=Y' syntax. >> It works for us. >> >> >>> and I suppose this is correct. But in fact I don't know the difference >>> between this one >>> above and >>> #PBS -l nodes=1 >>> >>> I have also checked that in maui.cfg there is no specification for >>> >>> JOBNODEMATCHPOLICY >>> >>> but, in fact I don't know what is the default. If EXACTNODE is the default >>> I should explicitely add a line to maui.cfg, correct? >>> >>> >> Check JOBNODEMATCHPOLICY in the Maui Admin guide, although it >> doesn't tell the default. >> >> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php >> >> You can add the line with your option for JOBNODEMATCHPOLICY >> to maui.cfg and restart maui. >> We use EXACTNODE here. >> >> Gus Correa >> >> >>> Leonardo >>> >>> On 19-04-2012 12:44, Gus Correa wrote: >>> >>>> Hi Leonardo >>>> >>>> Not sure if I understood the problem right. >>>> I guess the job is legitimate and running, >>>> but it surprises you that it is using four processors, >>>> right? >>>> >>>> Did the user request four processors, perhaps, >>>> even though he/she is running a serial job? >>>> #PBS -l nodes=1:ppn=4 >>>> This may be reasonable, say, if his/her job needs a lot >>>> of RAM, but the job is serial >>>> [or if it is Matlab ... the king of memory-greediness ...] >>>> >>>> Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]: >>>> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php >>>> If set to EXACTNODE full nodes will be allocated. >>>> >>>> I hope this helps, >>>> Gus Correa >>>> >>>> On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote: >>>> >>>> >>>>> Dear All, >>>>> >>>>> In a fresh installed torque/maui cluster the server reports >>>>> repeated execution of a job in a given node. (There is no job running >>>>> mpi)!. >>>>> >>>>> The output for pbsnodes for one given node gives: >>>>> >>>>> node131 >>>>> state = job-exclusive >>>>> np = 4 >>>>> properties = quadcore >>>>> ntype = cluster >>>>> jobs = 0/78898.master.cluster.XX.XX.XX, >>>>> 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX, >>>>> 3/78898.master.XX.XX.XX >>>>> status = >>>>> rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804 >>>>> 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC >>>>> 2007 x86_64,opsys=linux >>>>> gpus = 0 >>>>> >>>>> But, if we log in that node we will see what was expected, a single job. >>>>> Since the torque server (or maui) "believes" all cpu's of that node are >>>>> working, >>>>> no other jobs are sent. Any clues ? >>>>> >>>>> Thanks for the help! >>>>> >>>>> Leonardo >>>>> >>>>> Below, you find the output for >>>>> # qmgr -c "p s" >>>>> >>>>> # >>>>> # Create queues and set their attributes. >>>>> # >>>>> # >>>>> # Create and define queue padrao >>>>> # >>>>> create queue padrao >>>>> set queue padrao queue_type = Execution >>>>> set queue padrao resources_default.nodes = 7 >>>>> set queue padrao resources_default.walltime = 01:00:00 >>>>> set queue padrao max_user_run = 5 >>>>> set queue padrao enabled = True >>>>> set queue padrao started = True >>>>> # >>>>> # Create and define queue um_mes >>>>> # >>>>> create queue um_mes >>>>> set queue um_mes queue_type = Execution >>>>> set queue um_mes resources_max.nodes = 7 >>>>> set queue um_mes resources_default.nodes = 7 >>>>> set queue um_mes resources_default.walltime = 720:00:00 >>>>> set queue um_mes max_user_run = 5 >>>>> set queue um_mes enabled = True >>>>> set queue um_mes started = True >>>>> # >>>>> # Create and define queue batch >>>>> # >>>>> create queue batch >>>>> set queue batch queue_type = Execution >>>>> set queue batch resources_default.nodes = 1 >>>>> set queue batch resources_default.walltime = 01:00:00 >>>>> set queue batch enabled = True >>>>> set queue batch started = True >>>>> # >>>>> # Create and define queue um_dia >>>>> # >>>>> create queue um_dia >>>>> set queue um_dia queue_type = Execution >>>>> set queue um_dia resources_max.nodes = 7 >>>>> set queue um_dia resources_default.nodes = 7 >>>>> set queue um_dia resources_default.walltime = 24:00:00 >>>>> set queue um_dia max_user_run = 7 >>>>> set queue um_dia enabled = True >>>>> set queue um_dia started = True >>>>> # >>>>> # Create and define queue uma_semana >>>>> # >>>>> create queue uma_semana >>>>> set queue uma_semana queue_type = Execution >>>>> set queue uma_semana resources_max.nodes = 7 >>>>> set queue uma_semana resources_default.nodes = 7 >>>>> set queue uma_semana resources_default.walltime = 168:00:00 >>>>> set queue uma_semana max_user_run = 5 >>>>> set queue uma_semana enabled = True >>>>> set queue uma_semana started = True >>>>> # >>>>> # Create and define queue route >>>>> # >>>>> create queue route >>>>> set queue route queue_type = Route >>>>> set queue route route_destinations = padrao >>>>> set queue route route_destinations += padrao2 >>>>> set queue route enabled = True >>>>> set queue route started = True >>>>> # >>>>> # Set server attributes. >>>>> # >>>>> set server scheduling = True >>>>> set server acl_hosts = master.cluster.XX.XX.XX >>>>> set server acl_hosts += clusterapg >>>>> set server managers = root at master.cluster.XX.XX.XX >>>>> set server operators = root at master.cluster.XX.XX.XX >>>>> set server default_queue = padrao >>>>> set server log_events = 511 >>>>> set server mail_from = adm >>>>> set server scheduler_iteration = 600 >>>>> set server node_check_rate = 150 >>>>> set server tcp_timeout = 6 >>>>> set server mom_job_sync = True >>>>> set server keep_completed = 300 >>>>> set server next_job_number = 79033 >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>>> >>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/08db26ca/attachment.html From Michael_Stevens at affymetrix.com Thu Apr 19 10:43:07 2012 From: Michael_Stevens at affymetrix.com (Stevens, Michael) Date: Thu, 19 Apr 2012 09:43:07 -0700 Subject: [torqueusers] pbs_sched cores - repost Message-ID: I had posted this question a few weeks ago, and received no response. Would it be more appropriate to post this to -dev? I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being that we can use the physical resources of the torque cluster when no jobs are running. I am seeing crashes of pbs_sched when the cluster gets busy. Following is some data I've been able to assemble thus far: /var/log/messages Apr 19 08:53:31 node103 dhclient[1540]: DHCPREQUEST on eth0 to 10.80.101.10 port 67 (xid=0x6c91cbd5) Apr 19 08:53:31 cluster1 dhcpd: DHCPREQUEST for 10.80.101.123 from 00:50:56:b4:7b:a3 via eth0 Apr 19 08:53:31 cluster1 dhcpd: DHCPACK on 10.80.101.123 to 00:50:56:b4:7b:a3 via eth0 Apr 19 08:53:31 node103 dhclient[1540]: DHCPACK from 10.80.101.10 (xid=0x6c91cbd5) Apr 19 08:53:33 node103 ypbind: NIS domain: affymetrix.com, NIS server: nis2 Apr 19 08:53:33 node103 dhclient[1540]: bound to 10.80.101.123 -- renewal in 16219 seconds. Apr 19 08:54:09 node58 pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 707 in client_to_svr - errno:115 Operation now in progress Apr 19 08:54:15 cluster1 abrt[3555]: saved core dump of pid 2911 (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911.new/coredump (3100672 bytes) Apr 19 08:54:15 cluster1 abrtd: Directory 'ccpp-2012-04-19-08:54:14-2911' creation detected Apr 19 08:54:21 cluster1 abrtd: Package 'torque-scheduler' isn't signed with proper key Apr 19 08:54:21 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911 (res:2), deleting Apr 19 08:55:46 node40 pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 726 in client_to_svr - errno:115 Operation now in progress Apr 19 08:55:47 node2 pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 746 in client_to_svr - errno:115 Operation now in progress scheduler log 04/19/2012 08:52:55;0040; pbs_sched;Job;302539.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302540.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302541.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302542.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302543.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302544.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0080; pbs_sched;Svr;main;brk point 38178816 04/19/2012 08:52:58;0040; pbs_sched;Job;302545.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:08;0040; pbs_sched;Job;302546.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:10;0040; pbs_sched;Job;302547.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:14;0040; pbs_sched;Job;302548.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:18;0040; pbs_sched;Job;302549.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:24;0040; pbs_sched;Job;302550.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 09:14:01;0002; pbs_sched;Svr;Log;Log opened 04/19/2012 09:14:01;0002; pbs_sched;Svr;TokenAct;Account file /var/lib/torque/sched_priv/accounting/20120419 opened 04/19/2012 09:14:01;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 4707 04/19/2012 09:14:54;0040; pbs_sched;Job;302552.cluster1.cluster.affymetrix.com;Job Run gdb of the crash file [root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.2911 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: . [New Thread 2911] Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done. done. Loaded symbols for /usr/lib64/libtorque.so.2.0.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'. Program terminated with signal 11, Segmentation fault. #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=, num_resc=, available=0x7fffba82910c, allocated=0x7fffba829108, reserved=0x7fffba829104, down=0x7fffba829100) at ../Libifl/pbsD_resc.c:215 215 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 (gdb) bt #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=, num_resc=, available=0x7fffba82910c, allocated=0x7fffba829108, reserved=0x7fffba829104, down=0x7fffba829100) at ../Libifl/pbsD_resc.c:215 #1 0x000000000040c8d6 in ?? () #2 0x00007fffba829100 in ?? () #3 0x0000000000000000 in ?? () (gdb) The last few lines of strace read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+252+11des"..., 62) = 62 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+272+11des"..., 64) = 64 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+192+11des"..., 54) = 54 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+232+11des"..., 60) = 60 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+9+1+1+0+0+0", 262144) = 20 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 124) = 124 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+1", 262144) = 10 write(8, "+2+12+15+9Scheduler2+38302551.cl"..., 67) = 67 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout) write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 139) = 139 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout) write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+1", 262144) = 10 --- SIGSEGV (Segmentation fault) @ 0 (0) --- If there is any other information I can provide, please let me know as this is reproducible. -- Mike Stevens Senior UNIX Administrator Affymetrix | 3420 Central Expressway | Santa Clara, CA 95051 Tel: 408-731-5804 | Cell: 408-507-5738 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/2f50cdc3/attachment-0001.html From knielson at adaptivecomputing.com Fri Apr 20 09:02:42 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 20 Apr 2012 09:02:42 -0600 Subject: [torqueusers] can user limit max jobs running In-Reply-To: <4F8EEE6B.2020503@ldeo.columbia.edu> References: <4F8EEE6B.2020503@ldeo.columbia.edu> Message-ID: On Wed, Apr 18, 2012 at 10:40 AM, Gus Correa wrote: > On 04/18/2012 12:20 PM, Ashish Agarwal wrote: > > The slot limit is exactly what I'm looking for, but: > > > > * It is not mentioned in our man pages. I read now the man page on > > clusterresources.com , which is quite > > different than our installed man page. > > > > * It works on one of our clusters, but on a second one I get "qsub: Bad > > Job Array Request." Does this feature have to be enabled? Or is it only > > available from a certain version? The cluster it works on is qsub > > version: 2.5.10, and the one it does not work on is version: 2.4.12. > > > > > > As far as I can tell, by looking at 2.4.11 and 2.5.9, > this feature is absent in 2.4 and isn't mentioned on > its man pages, whereas the 2.5 man page has the > "An optional slot limit ... " additional text > under '-t array_request'. > > Gus, 2.4.11 does not have the slot limit capability. It was added in 2.5. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/180fbcc8/attachment.html From dbeer at adaptivecomputing.com Fri Apr 20 09:04:56 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 20 Apr 2012 09:04:56 -0600 Subject: [torqueusers] pbs_sched cores - repost In-Reply-To: References: Message-ID: Mike, So pbs_sched is crashing for you? Have you considered setting up Maui? There are very few shops that run pbs_sched and probably very few people that have experience to help you with problems you might encounter using it. Maui has a fairly large user base, and a lot of things are similar in Maui and Moab, giving you two large user bases that can potentially help you out. I personally know nothing almost nothing about pbs_sched and can be of almost no help. David On Thu, Apr 19, 2012 at 10:43 AM, Stevens, Michael < Michael_Stevens at affymetrix.com> wrote: > ** ** > > I had posted this question a few weeks ago, and received no response. > Would it be more appropriate to post this to ?dev? **** > > ** ** > > I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This > cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being > that we can use the physical resources of the torque cluster when no jobs > are running. > > I am seeing crashes of pbs_sched when the cluster gets busy. Following is > some data I?ve been able to assemble thus far:**** > > ** ** > > /var/log/messages**** > > ** ** > > Apr 19 08:53:31 node103 dhclient[1540]: DHCPREQUEST on eth0 to > 10.80.101.10 port 67 (xid=0x6c91cbd5)**** > > Apr 19 08:53:31 cluster1 dhcpd: DHCPREQUEST for 10.80.101.123 from > 00:50:56:b4:7b:a3 via eth0**** > > Apr 19 08:53:31 cluster1 dhcpd: DHCPACK on 10.80.101.123 to > 00:50:56:b4:7b:a3 via eth0**** > > Apr 19 08:53:31 node103 dhclient[1540]: DHCPACK from 10.80.101.10 > (xid=0x6c91cbd5)**** > > Apr 19 08:53:33 node103 ypbind: NIS domain: affymetrix.com, NIS server: > nis2**** > > Apr 19 08:53:33 node103 dhclient[1540]: bound to 10.80.101.123 -- renewal > in 16219 seconds.**** > > Apr 19 08:54:09 node58 pbs_mom: LOG_ERROR::Operation now in progress (115) > in scan_for_exiting, cannot connect to port 707 in client_to_svr - > errno:115 Operation now in progress**** > > Apr 19 08:54:15 cluster1 abrt[3555]: saved core dump of pid 2911 > (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911.new/coredump > (3100672 bytes)**** > > Apr 19 08:54:15 cluster1 abrtd: Directory 'ccpp-2012-04-19-08:54:14-2911' > creation detected**** > > Apr 19 08:54:21 cluster1 abrtd: Package 'torque-scheduler' isn't signed > with proper key**** > > Apr 19 08:54:21 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp- > 2012-04-19-08:54:14-2911 (res:2), deleting**** > > Apr 19 08:55:46 node40 pbs_mom: LOG_ERROR::Operation now in progress (115) > in post_epilogue, cannot connect to port 726 in client_to_svr - errno:115 > Operation now in progress**** > > Apr 19 08:55:47 node2 pbs_mom: LOG_ERROR::Operation now in progress (115) > in post_epilogue, cannot connect to port 746 in client_to_svr - errno:115 > Operation now in progress**** > > ** ** > > ** ** > > ** ** > > scheduler log**** > > ** ** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302539.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302540.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302541.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302542.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302543.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302544.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0080; pbs_sched;Svr;main;brk point 38178816**** > > 04/19/2012 08:52:58;0040; pbs_sched;Job; > 302545.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:08;0040; pbs_sched;Job; > 302546.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:10;0040; pbs_sched;Job; > 302547.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:14;0040; pbs_sched;Job; > 302548.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:18;0040; pbs_sched;Job; > 302549.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:24;0040; pbs_sched;Job; > 302550.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 09:14:01;0002; pbs_sched;Svr;Log;Log opened**** > > 04/19/2012 09:14:01;0002; pbs_sched;Svr;TokenAct;Account file > /var/lib/torque/sched_priv/accounting/20120419 opened**** > > 04/19/2012 09:14:01;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup > pid 4707**** > > 04/19/2012 09:14:54;0040; pbs_sched;Job; > 302552.cluster1.cluster.affymetrix.com;Job Run**** > > ** ** > > ** ** > > gdb of the crash file**** > > ** ** > > [root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.2911**** > > GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)**** > > Copyright (C) 2010 Free Software Foundation, Inc.**** > > License GPLv3+: GNU GPL version 3 or later < > http://gnu.org/licenses/gpl.html>**** > > This is free software: you are free to change and redistribute it.**** > > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > **** > > and "show warranty" for details.**** > > This GDB was configured as "x86_64-redhat-linux-gnu".**** > > For bug reporting instructions, please see:**** > > .**** > > [New Thread 2911]**** > > Missing separate debuginfo for **** > > Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install > /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d**** > > Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from > /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done.**** > > done.**** > > Loaded symbols for /usr/lib64/libtorque.so.2.0.0**** > > Reading symbols from /lib64/libc.so.6...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libc.so.6**** > > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/ld-linux-x86-64.so.2**** > > Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libnss_files.so.2**** > > Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libnss_dns.so.2**** > > Reading symbols from /lib64/libresolv.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libresolv.so.2**** > > Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'.**** > > Program terminated with signal 11, Segmentation fault.**** > > #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist= out>, num_resc=, **** > > available=0x7fffba82910c, allocated=0x7fffba829108, > reserved=0x7fffba829104, down=0x7fffba829100)**** > > at ../Libifl/pbsD_resc.c:215**** > > 215 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);*** > * > > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.47.el6_2.5.x86_64**** > > (gdb) bt**** > > #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist= out>, num_resc=, **** > > available=0x7fffba82910c, allocated=0x7fffba829108, > reserved=0x7fffba829104, down=0x7fffba829100)**** > > at ../Libifl/pbsD_resc.c:215**** > > #1 0x000000000040c8d6 in ?? ()**** > > #2 0x00007fffba829100 in ?? ()**** > > #3 0x0000000000000000 in ?? ()**** > > (gdb) **** > > ** ** > > ** ** > > The last few lines of strace**** > > ** ** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+252+11des"..., 62) = 62**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+272+11des"..., 64) = 64**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+192+11des"..., 54) = 54**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+232+11des"..., 60) = 60**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+9+1+1+0+0+0", 262144) = 20**** > > stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0**** > > write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 124) = 124**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+1", 262144) = 10**** > > write(8, "+2+12+15+9Scheduler2+38302551.cl"..., 67) = 67**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)**** > > write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 139) = 139**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)**** > > write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+1", 262144) = 10**** > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---**** > > ** ** > > If there is any other information I can provide, please let me know as > this is reproducible. > > **** > > ** ** > > --**** > > Mike Stevens **** > > Senior UNIX Administrator **** > > Affymetrix | 3420 Central Expressway | Santa Clara, CA 95051 **** > > Tel: 408-731-5804 | Cell: 408-507-5738**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/0d8e6fa0/attachment-0001.html From Michael_Stevens at affymetrix.com Fri Apr 20 09:11:26 2012 From: Michael_Stevens at affymetrix.com (Stevens, Michael) Date: Fri, 20 Apr 2012 08:11:26 -0700 Subject: [torqueusers] pbs_sched cores - repost In-Reply-To: <4F917AD2.10208@scss.tcd.ie> References: <4F917AD2.10208@scss.tcd.ie> Message-ID: > Have you investigated if this is an SELinux policy issue No I had not, since the crash only seems to happen under load. Selinux is disabled on all the cluster systems. -- Mike From Michael_Stevens at affymetrix.com Fri Apr 20 09:18:18 2012 From: Michael_Stevens at affymetrix.com (Stevens, Michael) Date: Fri, 20 Apr 2012 08:18:18 -0700 Subject: [torqueusers] pbs_sched cores - repost In-Reply-To: References: Message-ID: > Mike, > So pbs_sched is crashing for you? Have you considered setting up Maui? There are very few shops that run pbs_sched > and probably very few people that have experience to help you with problems you might encounter using it.. I had considered it, but I thought that I'd try to solve this issue before migrating to a new scheduler. I guess that maui moves up the to do list a bit ;). -- Mike From rsvancara at wsu.edu Fri Apr 20 09:38:37 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Fri, 20 Apr 2012 15:38:37 +0000 Subject: [torqueusers] NUMA -- A first try In-Reply-To: <20120419210119.GI9750@lbl.gov> References: <1F880D7A2494B346B5AB96481EAE704A15A536@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15A744@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15B97C@EXMB-03.ad.wsu.edu> <1F880D7A2494B346B5AB96481EAE704A15BA3F@EXMB-03.ad.wsu.edu> <20120419210119.GI9750@lbl.gov> Message-ID: <1F880D7A2494B346B5AB96481EAE704A15C540@EXMB-03.ad.wsu.edu> Thanks for the information. I will apply the "Michael Jennings" methodology to see what I can come up with. Thanks, Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Michael Jennings Sent: Thursday, April 19, 2012 2:01 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] NUMA -- A first try On Wednesday, 18 April 2012, at 23:29:59 (+0000), Svancara, Randall wrote: > Just a heads up, the reason NUMA was not being built in the RPM is > because the buildutils/torque.spec.in file does not include the > %{ac_with_numa} parameter in the ./configure section. Otherwise a > regular build would work fine. > > 187c184 > < --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} %{ac_with_numa} \ > --- > > --with-server-home=%{torque_home} > > --with-sendmail=%{sendmail_path} \ That's one reason. The other reason is that "--with numa" isn't supported by "make rpm" yet. Only a small subset of the available --with and --without rpmbuild options are supported by "make rpm" as this is not the standard, "accepted" method of building RPM packages. The "make rpm" system is there solely for people who would, for whatever reason, rather specify all their customizations at ./configure time instead of rpmbuild time. It's use is not recommended in the general case. Instead, use rpmbuild directly, e.g.: ./configure make dist rpmbuild -ta --with numa --with blcr torque-4.0.1.tar.gz If you really do want to use "make rpm" anyway, support will need to be added to configure.ac for it. Look for occurances of RPM_AC_OPTS in configure.ac to see how the existing support was done. > I could exactly figure it out as I am still learning about spec files, > but these lines may also need to change: > > > 19c19 > < #%bcond_with blcr > --- > > %bcond_with blcr > 25c25 > < #%bcond_with numa > --- > > %bcond_with numa > 35,37d34 > < %bcond_without bclr > < %bcond_without numa > < No, you don't want to do that. Those should not be on by default. And you don't want to comment out macros using #. While some macros can be disabled this way, many cannot (the most common troublemaker being %patch). Either double the percent sign or remove it before adding the pound sign: #%%macro or #macro HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From naveed at caltech.edu Fri Apr 20 15:37:43 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 20 Apr 2012 14:37:43 -0700 Subject: [torqueusers] priority job failing to get reservation Message-ID: <4F91D727.3080409@caltech.edu> I know this isn't technically torque, but i haven't seen any activity on the maui list and I thought there might be some overlap in users here. Torque 2.5.9 Maui 3.3.1 I am having an issue with a priority job not getting a reservation. When I set reservation depth to 2, the second priority job does get a reservation though. The cluster has 3552 core available for the queue it is submitted to, at the moment they are all in use. Since the jobs has the highest priority, it should start reserving nodes (and it does try.) When i change the RESERVATIONDEPTH to 2, the second highest priority job does get a reservation, though this is a much smaller job. Perhaps I am misunderstanding how these reservation work. If there a timefram in which it has to reserve nodes? We don't have a size limit on jobs and the cluster does have the resources for this job. Does anyone know what may be going on here? We have this type of workflow where some people send it very large jobs, and some small so I would like to figure out what is happening. Do you have any good strategies to deal with the type of workflow? Here is the checkjob output and as you can see, it isn't requesting any resources other than cores. I have no idea where it is getting the idle procs from since none are actually idle. perhaps it has do do with reservable nodes? The idle procs tends to fluctuate over time. checking job 213152 State: Idle Creds: user:user group:group class:default qos:dedicated WallTime: 00:00:00 of 1:12:00:00 SubmitTime: Fri Apr 6 03:35:23 (Time Queued Total: 7:45:59 Eligible: 1:30:06) Total Tasks: 1501 Req[0] TaskCount: 1501 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [default] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PREEMPTEE DEDICATEDNODE Attr: PREEMPTEE PE: 1501.00 StartPriority: 144235 job cannot run in partition DEFAULT (insufficient idle procs available: 1056 < 1501) Here are the relevant log entries: 04/06 03:35:24 MJobPReserve(213152,DEFAULT,ResCount,ResCountRej) 04/06 03:35:24 INFO: 3552 feasible tasks found for job 213152:0 in partition DEFAULT (1501 Needed) 04/06 03:35:24 ALERT: job 213152 cannot run in any partition 04/06 03:35:24 ALERT: cannot create new reservation for job 213152 (shape[1] 1501) 04/06 03:35:24 ALERT: cannot create new reservation for job 213152 04/06 03:35:24 ALERT: job '213152' cannot run (deferring job for 3600 seconds) 04/06 03:35:24 WARNING: cannot reserve priority job '213152' -- Naveed Near-Ansari -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4887 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/e683b49d/attachment.bin From naveed at caltech.edu Fri Apr 20 18:13:01 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 20 Apr 2012 17:13:01 -0700 Subject: [torqueusers] [Mauiusers] priority job failing to get reservation In-Reply-To: References: <4F91D4DD.2010204@caltech.edu> Message-ID: <4F91FB8D.2020902@caltech.edu> On 04/20/2012 04:23 PM, Lyn Gerner wrote: > Naveed, > > It looks like your setup is only showing 1056 procs, not 3552: >> PE: 1501.00 StartPriority: 144235 >> job cannot run in partition DEFAULT (insufficient idle procs available: >> 1056 < 1501) > You might play w/diagnose -t (partition) and diagnose -j (job) to see > what they tell you. Also, you could try to explicitly make a > reservation for the job, and maybe then you could get info from > diagnose -r (though attempting the setres may give enough error info). > > Good luck, > Lyn Thanks for looking. I think it is configured for 3768 (i said 3552 because the queue it was sent to has that many available to it). i didn't see anything clear in either diagnose command. I attempted to create a reservation, but it failed. # setres -u ortega -d 4:00:00:00 TASKS==1501 ERROR: 'setres' failed ERROR: cannot select 1501 tasks for reservation for 3:13:33:56 ERROR: cannot select requested tasks for 'TASKS==1501' #diagnose -t Displaying Partition Status System Partition Settings: PList: DEFAULT PDef: DEFAULT Name Procs DEFAULT 3768 Partition Configured Up U/C Dedicated D/U Active A/U NODE---------------------------------------------------------------------------- DEFAULT 314 313 99.68% 297 94.89% 297 94.89% PROC---------------------------------------------------------------------------- DEFAULT 3768 3756 99.68% 3564 94.89% 3000 79.87% MEM---------------------------------------------------------------------------- DEFAULT 15156264 15107978 99.68% 14335282 94.89% 0 0.00% SWAP---------------------------------------------------------------------------- DEFAULT 30227950 30131665 99.68% 28590985 94.89% 1400704 4.65% DISK---------------------------------------------------------------------------- DEFAULT 314 313 99.68% 297 94.89% 0 0.00% Class/Queue State [ :]... DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu 3756:3756] #diagnose -j 220559 Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1] [default] -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4887 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/963ffcc5/attachment-0001.bin From naveed at caltech.edu Fri Apr 20 18:59:56 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 20 Apr 2012 17:59:56 -0700 Subject: [torqueusers] [Mauiusers] priority job failing to get reservation In-Reply-To: References: <4F91D4DD.2010204@caltech.edu> <4F91FB8D.2020902@caltech.edu> Message-ID: Thanks. The idle procs actually fluctuates: job cannot run in partition DEFAULT (insufficient idle procs available: 744 < 1501) I don't think it is mapping to procs since there are 628 procs on the system (314 nodes * 2 procs) The QOS does request dedicated nodes. I have seen no issue with this on all other jobs. When someone requests 12 tasks they get 1 12 core machine. I think i may be misunderstanding how priority reservations work. Does it try to find available nodes to reserve within a timeframe and no procs will be availble within that time frame, or is it supposed to look out forever to find the procs available. We have a lot of long running processes, so if it is looking within a time frame (say a month), it may not be able to find the resources. If this is the case, is it possible to change how far ahead it looks? I couldn't find anything in the documentation that describes specifically how it finds resources for priority based reservations. Naveed Near-Ansari On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote: > So does the checkjob for 220559 still show the "insufficient idle > procs available: 1056 < 1501" msg? > > Seems like somehow the TASKS request is not mapping to cores (of which > I surmise you have 3576) but rather procs (which in the above you have > 1056). > > I am really grasping at straws on this: is the "ded" QOS requesting > dedicated nodes, and you don't have enough? > > Not sure where else to tell you to look. > > Best of luck, > Lyn > > On 4/20/12, Naveed Near-Ansari wrote: >> >> >> On 04/20/2012 04:23 PM, Lyn Gerner wrote: >>> Naveed, >>> >>> It looks like your setup is only showing 1056 procs, not 3552: >>>> PE: 1501.00 StartPriority: 144235 >>>> job cannot run in partition DEFAULT (insufficient idle procs available: >>>> 1056 < 1501) >>> You might play w/diagnose -t (partition) and diagnose -j (job) to see >>> what they tell you. Also, you could try to explicitly make a >>> reservation for the job, and maybe then you could get info from >>> diagnose -r (though attempting the setres may give enough error info). >>> >>> Good luck, >>> Lyn >> >> Thanks for looking. >> >> I think it is configured for 3768 (i said 3552 because the queue it was >> sent to has that many available to it). i didn't see anything clear in >> either diagnose command. I attempted to create a reservation, but it >> failed. >> >> # setres -u ortega -d 4:00:00:00 TASKS==1501 >> ERROR: 'setres' failed >> ERROR: cannot select 1501 tasks for reservation for 3:13:33:56 >> ERROR: cannot select requested tasks for 'TASKS==1501' >> >> >> >> #diagnose -t >> Displaying Partition Status >> >> System Partition Settings: PList: DEFAULT PDef: DEFAULT >> >> Name Procs >> >> DEFAULT 3768 >> >> Partition Configured Up U/C Dedicated D/U >> Active A/U >> >> NODE---------------------------------------------------------------------------- >> DEFAULT 314 313 99.68% 297 94.89% >> 297 94.89% >> PROC---------------------------------------------------------------------------- >> DEFAULT 3768 3756 99.68% 3564 94.89% >> 3000 79.87% >> MEM---------------------------------------------------------------------------- >> DEFAULT 15156264 15107978 99.68% 14335282 94.89% >> 0 0.00% >> SWAP---------------------------------------------------------------------------- >> DEFAULT 30227950 30131665 99.68% 28590985 94.89% >> 1400704 4.65% >> DISK---------------------------------------------------------------------------- >> DEFAULT 314 313 99.68% 297 94.89% >> 0 0.00% >> >> Class/Queue State >> >> [ :]... >> >> DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu >> 3756:3756] >> >> >> >> #diagnose -j 220559 >> Name State Par Proc QOS WCLimit R Min User >> Group Account QueuedTime Network Opsys Arch Mem Disk >> Procs Class Features >> >> 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega >> simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0 >> [default:1] [default] >> >> >> >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/432609ea/attachment.bin From naveed at caltech.edu Fri Apr 20 21:16:30 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Fri, 20 Apr 2012 20:16:30 -0700 Subject: [torqueusers] [Mauiusers] priority job failing to get reservation In-Reply-To: References: <4F91D4DD.2010204@caltech.edu> <4F91FB8D.2020902@caltech.edu> Message-ID: Thanks, no recurring reservations at all. reservation policy is already set that way. RESERVATIONPOLICY CURRENTHIGHEST I have been having a dicksens of a time figuring out the best policy for our cluster. Lots of long jobs, some small some large. Naveed Near-Ansari On Apr 20, 2012, at 7:41 PM, Lyn Gerner wrote: > Yes, the idle nodes do fluctuate. If you have any recurring > reservations (say, for a weekly maintenance window), then it may not > be able to find a big enough window to run a large, 4-day job, on > dedicated nodes. > > You might also want to check to see if RESERVATIONPOLICY is set to > HIGHEST, to make sure that the job keeps its priority reservation, if > it ever gets to the top of the queue. > > Good luck, > Lyn > > On 4/20/12, Naveed Near-Ansari wrote: >> Thanks. >> >> The idle procs actually fluctuates: >> >> job cannot run in partition DEFAULT (insufficient idle procs available: 744 >> < 1501) >> >> I don't think it is mapping to procs since there are 628 procs on the system >> (314 nodes * 2 procs) >> >> The QOS does request dedicated nodes. I have seen no issue with this on all >> other jobs. When someone requests 12 tasks they get 1 12 core machine. >> >> I think i may be misunderstanding how priority reservations work. Does it >> try to find available nodes to reserve within a timeframe and no procs will >> be availble within that time frame, or is it supposed to look out forever >> to find the procs available. We have a lot of long running processes, so if >> it is looking within a time frame (say a month), it may not be able to find >> the resources. If this is the case, is it possible to change how far ahead >> it looks? >> >> I couldn't find anything in the documentation that describes specifically >> how it finds resources for priority based reservations. >> >> Naveed Near-Ansari >> >> >> On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote: >> >>> So does the checkjob for 220559 still show the "insufficient idle >>> procs available: 1056 < 1501" msg? >>> >>> Seems like somehow the TASKS request is not mapping to cores (of which >>> I surmise you have 3576) but rather procs (which in the above you have >>> 1056). >>> >>> I am really grasping at straws on this: is the "ded" QOS requesting >>> dedicated nodes, and you don't have enough? >>> >>> Not sure where else to tell you to look. >>> >>> Best of luck, >>> Lyn >>> >>> On 4/20/12, Naveed Near-Ansari wrote: >>>> >>>> >>>> On 04/20/2012 04:23 PM, Lyn Gerner wrote: >>>>> Naveed, >>>>> >>>>> It looks like your setup is only showing 1056 procs, not 3552: >>>>>> PE: 1501.00 StartPriority: 144235 >>>>>> job cannot run in partition DEFAULT (insufficient idle procs available: >>>>>> 1056 < 1501) >>>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see >>>>> what they tell you. Also, you could try to explicitly make a >>>>> reservation for the job, and maybe then you could get info from >>>>> diagnose -r (though attempting the setres may give enough error info). >>>>> >>>>> Good luck, >>>>> Lyn >>>> >>>> Thanks for looking. >>>> >>>> I think it is configured for 3768 (i said 3552 because the queue it was >>>> sent to has that many available to it). i didn't see anything clear in >>>> either diagnose command. I attempted to create a reservation, but it >>>> failed. >>>> >>>> # setres -u ortega -d 4:00:00:00 TASKS==1501 >>>> ERROR: 'setres' failed >>>> ERROR: cannot select 1501 tasks for reservation for 3:13:33:56 >>>> ERROR: cannot select requested tasks for 'TASKS==1501' >>>> >>>> >>>> >>>> #diagnose -t >>>> Displaying Partition Status >>>> >>>> System Partition Settings: PList: DEFAULT PDef: DEFAULT >>>> >>>> Name Procs >>>> >>>> DEFAULT 3768 >>>> >>>> Partition Configured Up U/C Dedicated D/U >>>> Active A/U >>>> >>>> NODE---------------------------------------------------------------------------- >>>> DEFAULT 314 313 99.68% 297 94.89% >>>> 297 94.89% >>>> PROC---------------------------------------------------------------------------- >>>> DEFAULT 3768 3756 99.68% 3564 94.89% >>>> 3000 79.87% >>>> MEM---------------------------------------------------------------------------- >>>> DEFAULT 15156264 15107978 99.68% 14335282 94.89% >>>> 0 0.00% >>>> SWAP---------------------------------------------------------------------------- >>>> DEFAULT 30227950 30131665 99.68% 28590985 94.89% >>>> 1400704 4.65% >>>> DISK---------------------------------------------------------------------------- >>>> DEFAULT 314 313 99.68% 297 94.89% >>>> 0 0.00% >>>> >>>> Class/Queue State >>>> >>>> [ :]... >>>> >>>> DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu >>>> 3756:3756] >>>> >>>> >>>> >>>> #diagnose -j 220559 >>>> Name State Par Proc QOS WCLimit R Min User >>>> Group Account QueuedTime Network Opsys Arch Mem Disk >>>> Procs Class Features >>>> >>>> 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega >>>> simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0 >>>> [default:1] [default] >>>> >>>> >>>> >>>> >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/f4fa1c22/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/f4fa1c22/attachment-0001.bin From Gareth.Williams at csiro.au Sat Apr 21 04:49:26 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 21 Apr 2012 20:49:26 +1000 Subject: [torqueusers] can user limit max jobs running In-Reply-To: References: <4F8EEE6B.2020503@ldeo.columbia.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102DCC7433B@exvic-mbx04.nexus.csiro.au> On a related note, there is a utility script in the contrib directory called 'qpool'. It loops through a cycle of sleeping and topping up a queue to a given number of jobs from a given directory. qpool was not written with array jobs in mind - you would need to explicitly split the array into separate jobs. I think it's a pretty neat utility as it is all in user space to solve a user problem. Gareth From nt_mahmood at yahoo.com Mon Apr 23 06:41:30 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 23 Apr 2012 05:41:30 -0700 (PDT) Subject: [torqueusers] torque for windows Message-ID: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> Dear All, I want to know is there a windows based version of torque? If the answer is no, do you have any experience on managing a windows based cluster? please let me know? // Naderan *Mahmood; From dbeer at adaptivecomputing.com Mon Apr 23 08:45:36 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 23 Apr 2012 08:45:36 -0600 Subject: [torqueusers] torque for windows In-Reply-To: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> References: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> Message-ID: These are the two closest things I know of - 1. TORQUE can run on cygwin thanks to Vikentsi Lapa and his colleagues at UIIP Minsk. I don't know where Maui or whatever scheduler is run in this setup, whether on a linux box or cygwin. 2. Moab is able to manage a windows cluster but with the help of Microsoft Cluster Server. HTH, David On Mon, Apr 23, 2012 at 6:41 AM, Mahmood Naderan wrote: > > Dear All, > I want to know is there a windows based version of torque? If the answer > is no, do you have any experience on managing a windows based cluster? > please let me know? > > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/c9621b41/attachment.html From rhys.hill at adelaide.edu.au Sun Apr 22 21:31:53 2012 From: rhys.hill at adelaide.edu.au (Rhys Hill) Date: Mon, 23 Apr 2012 03:31:53 +0000 Subject: [torqueusers] Torque 4.0 and job arrays Message-ID: Hi everyone, I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd been having some trouble with cpusets and I'd hoped that the support for hwloc would resolve the problem. cpusets are now working very well, but I'm having a lot of trouble with job arrays, which form a very large part of our workload. Torque 4.0.0 would regularly lock-up when processing job arrays, so I upgraded to the most recent 4.0.1 snapshot, and that behaves much better, but still seems unstable compared to 2.5.9. One concrete issue is that many of our jobs that worked fine with 2.5.9 are now stalling with 4.0.1 with the following message: "Arrays may only be given array dependencies" which only seems to appear in the server logs and is otherwise invisible. This was certainly never true before, and doesn't really make sense. We frequently use array->single job dependencies for scatter-gather type operations. Once the above message has been printed, the job arrays sit in a hold state forever. They can't be removed using qdel and if I try to break the hold using qrls or mjobctl, they move into the queued state, but they disappear from moab and never actually start, and still can't be removed. The only way I can get rid of them is to bring down pbs_server, which has to killed via `killall -QUIT pbs_server` since the init script cannot stop the process properly, and delete the job files manually. I'm currently thinking of just reverting to the old, working version of torque, but has anyone else had trouble with job arrays and can the above problems be fixed somehow? Thanks, -------------------------------------------------------------------------------- Rhys Hill, Senior Research Associate Australian Centre for Visual Technologies University of Adelaide Phone: +61 8 8313 6197 Mail: Fax: +61 8 8313 4366 School of Computer Science University of Adelaide Adelaide, Australia http://www.cs.adelaide.edu.au/~rhys/ 5005 -------------------------------------------------------------------------------- From laotsao at gmail.com Mon Apr 23 07:09:28 2012 From: laotsao at gmail.com (=?UTF-8?B?Ikh1bmctU2hlbmcgVHNhbyAoTGFvIFRzYW8g6ICB5pu5KSBQaC5ELiI=?=) Date: Mon, 23 Apr 2012 09:09:28 -0400 Subject: [torqueusers] torque for windows In-Reply-To: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> References: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> Message-ID: <4F955488.1040205@gmail.com> hi PBSpro support window W2008R2 HPC server has its own workload manager regards On 4/23/2012 8:41 AM, Mahmood Naderan wrote: > Dear All, > I want to know is there a windows based version of torque? If the answer is no, do you have any experience on managing a windows based cluster? please let me know? > > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.wordpress.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 568 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/56c77a7e/attachment.vcf From acaird at umich.edu Mon Apr 23 09:16:55 2012 From: acaird at umich.edu (Andrew Caird) Date: Mon, 23 Apr 2012 11:16:55 -0400 Subject: [torqueusers] torque for windows In-Reply-To: References: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> Message-ID: On #2, there is an excellent video describing this at http://www.adaptivecomputing.com/videos/hybrid There is also other information at www.adaptivecomputing.com about doing this. --andy On Mon, Apr 23, 2012 at 10:45 AM, David Beer wrote: > These are the two closest things I know of - > > 1. TORQUE can run on cygwin thanks to Vikentsi Lapa and his colleagues at > UIIP Minsk. I don't know where Maui or whatever scheduler is run in this > setup, whether on a linux box or cygwin. > 2. Moab is able to manage a windows cluster but with the help of Microsoft > Cluster Server. > > HTH, > > David > > On Mon, Apr 23, 2012 at 6:41 AM, Mahmood Naderan > wrote: >> >> >> Dear All, >> I want to know is there a windows based version of torque? If the answer >> is no, do you have any experience on managing a windows based cluster? >> please let me know? >> >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From jascha.wang at gmail.com Mon Apr 23 09:27:30 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Mon, 23 Apr 2012 23:27:30 +0800 Subject: [torqueusers] doubt of running job on infiniband network Message-ID: assume my cluster has nodes with ethernet and infiniband device installed, say, 192.168.0.8 for ethernet network and 10.10.0.* with infibinand network. what should i set if i want to submit my jobs to run on the infiniband network? xiangqian From jjc at iastate.edu Mon Apr 23 10:36:53 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 23 Apr 2012 16:36:53 +0000 Subject: [torqueusers] pbs_sched cores - repost In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210908F705@ITSDAG1D.its.iastate.edu> Michael, Two possibilities worth exploring: 1) It seems like you must be using different ports, I see references pbs_mom actions on nodes 58, 40 and 2 related to ports 707, 726 and 746 , normally I'd expect TORQUE port numbers in the 15000+ range. Is there an issue with a mismatch of ports between nodes ? 2) The following two URLS relate to the resolution of an issue that seems similar to yours (recent upgrade, Torque having problems some of the time.) http://www.clusterresources.com/pipermail/torqueusers/2011-March/012540.html http://serverfault.com/questions/253932/torque-works-half-of-the-time-fails-no-permission-the-other-half James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Stevens, Michael Sent: Thursday, April 19, 2012 11:43 AM To: torqueusers at supercluster.org Subject: [torqueusers] pbs_sched cores - repost I had posted this question a few weeks ago, and received no response. Would it be more appropriate to post this to -dev? I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being that we can use the physical resources of the torque cluster when no jobs are running. I am seeing crashes of pbs_sched when the cluster gets busy. Following is some data I've been able to assemble thus far: /var/log/messages Apr 19 08:53:31 node103 dhclient[1540]: DHCPREQUEST on eth0 to 10.80.101.10 port 67 (xid=0x6c91cbd5) Apr 19 08:53:31 cluster1 dhcpd: DHCPREQUEST for 10.80.101.123 from 00:50:56:b4:7b:a3 via eth0 Apr 19 08:53:31 cluster1 dhcpd: DHCPACK on 10.80.101.123 to 00:50:56:b4:7b:a3 via eth0 Apr 19 08:53:31 node103 dhclient[1540]: DHCPACK from 10.80.101.10 (xid=0x6c91cbd5) Apr 19 08:53:33 node103 ypbind: NIS domain: affymetrix.com, NIS server: nis2 Apr 19 08:53:33 node103 dhclient[1540]: bound to 10.80.101.123 -- renewal in 16219 seconds. Apr 19 08:54:09 node58 pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 707 in client_to_svr - errno:115 Operation now in progress Apr 19 08:54:15 cluster1 abrt[3555]: saved core dump of pid 2911 (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911.new/coredump (3100672 bytes) Apr 19 08:54:15 cluster1 abrtd: Directory 'ccpp-2012-04-19-08:54:14-2911' creation detected Apr 19 08:54:21 cluster1 abrtd: Package 'torque-scheduler' isn't signed with proper key Apr 19 08:54:21 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911 (res:2), deleting Apr 19 08:55:46 node40 pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 726 in client_to_svr - errno:115 Operation now in progress Apr 19 08:55:47 node2 pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 746 in client_to_svr - errno:115 Operation now in progress scheduler log 04/19/2012 08:52:55;0040; pbs_sched;Job;302539.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302540.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302541.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302542.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302543.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0040; pbs_sched;Job;302544.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:52:55;0080; pbs_sched;Svr;main;brk point 38178816 04/19/2012 08:52:58;0040; pbs_sched;Job;302545.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:08;0040; pbs_sched;Job;302546.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:10;0040; pbs_sched;Job;302547.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:14;0040; pbs_sched;Job;302548.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:18;0040; pbs_sched;Job;302549.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 08:53:24;0040; pbs_sched;Job;302550.cluster1.cluster.affymetrix.com;Job Run 04/19/2012 09:14:01;0002; pbs_sched;Svr;Log;Log opened 04/19/2012 09:14:01;0002; pbs_sched;Svr;TokenAct;Account file /var/lib/torque/sched_priv/accounting/20120419 opened 04/19/2012 09:14:01;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 4707 04/19/2012 09:14:54;0040; pbs_sched;Job;302552.cluster1.cluster.affymetrix.com;Job Run gdb of the crash file [root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.2911 GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: . [New Thread 2911] Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done. done. Loaded symbols for /usr/lib64/libtorque.so.2.0.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libresolv.so.2 Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'. Program terminated with signal 11, Segmentation fault. #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=, num_resc=, available=0x7fffba82910c, allocated=0x7fffba829108, reserved=0x7fffba829104, down=0x7fffba829100) at ../Libifl/pbsD_resc.c:215 215 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 (gdb) bt #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=, num_resc=, available=0x7fffba82910c, allocated=0x7fffba829108, reserved=0x7fffba829104, down=0x7fffba829100) at ../Libifl/pbsD_resc.c:215 #1 0x000000000040c8d6 in ?? () #2 0x00007fffba829100 in ?? () #3 0x0000000000000000 in ?? () (gdb) The last few lines of strace read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+252+11des"..., 62) = 62 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+272+11des"..., 64) = 64 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+192+11des"..., 54) = 54 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+51+9Scheduler+12+232+11des"..., 60) = 60 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+6+0", 262144) = 12 write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+9+1+1+0+0+0", 262144) = 20 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 124) = 124 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+1", 262144) = 10 write(8, "+2+12+15+9Scheduler2+38302551.cl"..., 67) = 67 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout) write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 139) = 139 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout) write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42 poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}]) read(8, "+2+1+0+0+1", 262144) = 10 --- SIGSEGV (Segmentation fault) @ 0 (0) --- If there is any other information I can provide, please let me know as this is reproducible. -- Mike Stevens Senior UNIX Administrator Affymetrix | 3420 Central Expressway | Santa Clara, CA 95051 Tel: 408-731-5804 | Cell: 408-507-5738 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/8f4024b5/attachment-0001.html From knielson at adaptivecomputing.com Mon Apr 23 10:59:52 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 23 Apr 2012 10:59:52 -0600 Subject: [torqueusers] pbs_sched cores - repost In-Reply-To: <242421BFAF465844BE24EB90BB97E2210908F705@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210908F705@ITSDAG1D.its.iastate.edu> Message-ID: On Mon, Apr 23, 2012 at 10:36 AM, Coyle, James J [ITACD] wrote: > Michael,**** > > ** ** > > Two possibilities worth exploring:**** > > ** ** > > **1) **It seems like you must be using different ports, I see > references pbs_mom actions on nodes 58, 40 and 2 related to ports 707, 726 > and 746 , normally I?d expect TORQUE port numbers in the 15000+ range. > The ports 7078, 726 and 746 are privileged ports. If this is from the server to the mom they will be connecting on port 15002. If it is from a client app to the server they will be hitting port 15001. By default that is. Ken > **** > > Is there an issue with a mismatch of ports between nodes ?**** > > ** ** > > **2) **The following two URLS relate to the resolution of an issue > that seems similar to yours (recent upgrade, Torque having problems some of > the time.) **** > > ** ** > > > http://www.clusterresources.com/pipermail/torqueusers/2011-March/012540.html > **** > > > http://serverfault.com/questions/253932/torque-works-half-of-the-time-fails-no-permission-the-other-half > **** > > ** ** > > ** ** > > ** ** > > James Coyle, PhD**** > > High Performance Computing Group **** > > Iowa State Univ. **** > > web: http://jjc.public.iastate.edu/ > **** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *Stevens, Michael > *Sent:* Thursday, April 19, 2012 11:43 AM > *To:* torqueusers at supercluster.org > *Subject:* [torqueusers] pbs_sched cores - repost**** > > ** ** > > ** ** > > I had posted this question a few weeks ago, and received no response. > Would it be more appropriate to post this to ?dev? **** > > ** ** > > I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This > cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being > that we can use the physical resources of the torque cluster when no jobs > are running. > > I am seeing crashes of pbs_sched when the cluster gets busy. Following is > some data I?ve been able to assemble thus far:**** > > ** ** > > /var/log/messages**** > > ** ** > > Apr 19 08:53:31 node103 dhclient[1540]: DHCPREQUEST on eth0 to > 10.80.101.10 port 67 (xid=0x6c91cbd5)**** > > Apr 19 08:53:31 cluster1 dhcpd: DHCPREQUEST for 10.80.101.123 from > 00:50:56:b4:7b:a3 via eth0**** > > Apr 19 08:53:31 cluster1 dhcpd: DHCPACK on 10.80.101.123 to > 00:50:56:b4:7b:a3 via eth0**** > > Apr 19 08:53:31 node103 dhclient[1540]: DHCPACK from 10.80.101.10 > (xid=0x6c91cbd5)**** > > Apr 19 08:53:33 node103 ypbind: NIS domain: affymetrix.com, NIS server: > nis2**** > > Apr 19 08:53:33 node103 dhclient[1540]: bound to 10.80.101.123 -- renewal > in 16219 seconds.**** > > Apr 19 08:54:09 node58 pbs_mom: LOG_ERROR::Operation now in progress (115) > in scan_for_exiting, cannot connect to port 707 in client_to_svr - > errno:115 Operation now in progress**** > > Apr 19 08:54:15 cluster1 abrt[3555]: saved core dump of pid 2911 > (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911.new/coredump > (3100672 bytes)**** > > Apr 19 08:54:15 cluster1 abrtd: Directory 'ccpp-2012-04-19-08:54:14-2911' > creation detected**** > > Apr 19 08:54:21 cluster1 abrtd: Package 'torque-scheduler' isn't signed > with proper key**** > > Apr 19 08:54:21 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp- > 2012-04-19-08:54:14-2911 (res:2), deleting**** > > Apr 19 08:55:46 node40 pbs_mom: LOG_ERROR::Operation now in progress (115) > in post_epilogue, cannot connect to port 726 in client_to_svr - errno:115 > Operation now in progress**** > > Apr 19 08:55:47 node2 pbs_mom: LOG_ERROR::Operation now in progress (115) > in post_epilogue, cannot connect to port 746 in client_to_svr - errno:115 > Operation now in progress**** > > ** ** > > ** ** > > ** ** > > scheduler log**** > > ** ** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302539.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302540.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302541.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302542.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302543.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0040; pbs_sched;Job; > 302544.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:52:55;0080; pbs_sched;Svr;main;brk point 38178816**** > > 04/19/2012 08:52:58;0040; pbs_sched;Job; > 302545.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:08;0040; pbs_sched;Job; > 302546.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:10;0040; pbs_sched;Job; > 302547.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:14;0040; pbs_sched;Job; > 302548.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:18;0040; pbs_sched;Job; > 302549.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 08:53:24;0040; pbs_sched;Job; > 302550.cluster1.cluster.affymetrix.com;Job Run**** > > 04/19/2012 09:14:01;0002; pbs_sched;Svr;Log;Log opened**** > > 04/19/2012 09:14:01;0002; pbs_sched;Svr;TokenAct;Account file > /var/lib/torque/sched_priv/accounting/20120419 opened**** > > 04/19/2012 09:14:01;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup > pid 4707**** > > 04/19/2012 09:14:54;0040; pbs_sched;Job; > 302552.cluster1.cluster.affymetrix.com;Job Run**** > > ** ** > > ** ** > > gdb of the crash file**** > > ** ** > > [root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.2911**** > > GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)**** > > Copyright (C) 2010 Free Software Foundation, Inc.**** > > License GPLv3+: GNU GPL version 3 or later < > http://gnu.org/licenses/gpl.html>**** > > This is free software: you are free to change and redistribute it.**** > > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > **** > > and "show warranty" for details.**** > > This GDB was configured as "x86_64-redhat-linux-gnu".**** > > For bug reporting instructions, please see:**** > > .**** > > [New Thread 2911]**** > > Missing separate debuginfo for **** > > Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install > /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d**** > > Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from > /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done.**** > > done.**** > > Loaded symbols for /usr/lib64/libtorque.so.2.0.0**** > > Reading symbols from /lib64/libc.so.6...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libc.so.6**** > > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/ld-linux-x86-64.so.2**** > > Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libnss_files.so.2**** > > Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libnss_dns.so.2**** > > Reading symbols from /lib64/libresolv.so.2...(no debugging symbols > found)...done.**** > > Loaded symbols for /lib64/libresolv.so.2**** > > Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'.**** > > Program terminated with signal 11, Segmentation fault.**** > > #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist= out>, num_resc=, **** > > available=0x7fffba82910c, allocated=0x7fffba829108, > reserved=0x7fffba829104, down=0x7fffba829100)**** > > at ../Libifl/pbsD_resc.c:215**** > > 215 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);*** > * > > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.47.el6_2.5.x86_64**** > > (gdb) bt**** > > #0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist= out>, num_resc=, **** > > available=0x7fffba82910c, allocated=0x7fffba829108, > reserved=0x7fffba829104, down=0x7fffba829100)**** > > at ../Libifl/pbsD_resc.c:215**** > > #1 0x000000000040c8d6 in ?? ()**** > > #2 0x00007fffba829100 in ?? ()**** > > #3 0x0000000000000000 in ?? ()**** > > (gdb) **** > > ** ** > > ** ** > > The last few lines of strace**** > > ** ** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+252+11des"..., 62) = 62**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+272+11des"..., 64) = 64**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+192+11des"..., 54) = 54**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+51+9Scheduler+12+232+11des"..., 60) = 60**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+6+0", 262144) = 12**** > > write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+9+1+1+0+0+0", 262144) = 20**** > > stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0**** > > write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 124) = 124**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+1", 262144) = 10**** > > write(8, "+2+12+15+9Scheduler2+38302551.cl"..., 67) = 67**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)**** > > write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 139) = 139**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)**** > > write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42**** > > poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, > revents=POLLIN}])**** > > read(8, "+2+1+0+0+1", 262144) = 10**** > > --- SIGSEGV (Segmentation fault) @ 0 (0) ---**** > > ** ** > > If there is any other information I can provide, please let me know as > this is reproducible.**** > > ** ** > > --**** > > Mike Stevens **** > > Senior UNIX Administrator **** > > Affymetrix | 3420 Central Expressway | Santa Clara, CA 95051 **** > > Tel: 408-731-5804 | Cell: 408-507-5738**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/5af9b909/attachment-0001.html From jjc at iastate.edu Mon Apr 23 11:04:33 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 23 Apr 2012 17:04:33 +0000 Subject: [torqueusers] doubt of running job on infiniband network In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210908F73A@ITSDAG1D.its.iastate.edu> Preface: I assume that you mean that you have a "question" about running on an infinband network, rather than being "skeptical" about whether you can run a job on an infinband network. Whatever application you want to use the Infiniband connection has to have Infiniband support compiled in. Typically, the application is MPI. If so, you will need to install MPI (usually openmpi) with Infiniband support. For openmpi compiled this way, communication uses IB if it is available, otherwise it will use IP (Ethernet) communication. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Xiangqian Wang >Sent: Monday, April 23, 2012 10:28 AM >To: Torque Users Mailing List >Subject: [torqueusers] doubt of running job on infiniband network > >assume my cluster has nodes with ethernet and infiniband device >installed, say, 192.168.0.8 for ethernet network and 10.10.0.* with >infibinand network. what should i set if i want to submit my jobs to >run on the infiniband network? > >xiangqian >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Mon Apr 23 11:17:34 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 23 Apr 2012 11:17:34 -0600 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: References: Message-ID: On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill wrote: > Hi everyone, > > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd > been > having some trouble with cpusets and I'd hoped that the support for hwloc > would > resolve the problem. cpusets are now working very well, but I'm having a > lot of > trouble with job arrays, which form a very large part of our workload. > > Torque 4.0.0 would regularly lock-up when processing job arrays, so I > upgraded to > the most recent 4.0.1 snapshot, and that behaves much better, but still > seems > unstable compared to 2.5.9. > > One concrete issue is that many of our jobs that worked fine with 2.5.9 > are now > stalling with 4.0.1 with the following message: > > "Arrays may only be given array dependencies" > > which only seems to appear in the server logs and is otherwise invisible. > This > was certainly never true before, and doesn't really make sense. We > frequently > use array->single job dependencies for scatter-gather type operations. > > Once the above message has been printed, the job arrays sit in a hold > state forever. > They can't be removed using qdel and if I try to break the hold using qrls > or > mjobctl, they move into the queued state, but they disappear from moab and > never > actually start, and still can't be removed. The only way I can get rid of > them > is to bring down pbs_server, which has to killed via `killall -QUIT > pbs_server` > since the init script cannot stop the process properly, and delete the job > files manually. > > I'm currently thinking of just reverting to the old, working version of > torque, > but has anyone else had trouble with job arrays and can the above problems > be > fixed somehow? > > Thanks, > > > -------------------------------------------------------------------------------- > Rhys Hill, Senior Research > Associate > Australian Centre for Visual Technologies University of > Adelaide > > Rhys, Thanks for the information. We will look at this in TORQUE 4.0. In the mean time you may want to create a ticket in bugzilla. http://www.clusterresources.com/bugzilla/ Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/83717aef/attachment.html From dbeer at adaptivecomputing.com Mon Apr 23 11:22:27 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 23 Apr 2012 11:22:27 -0600 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: References: Message-ID: Rhys, I'm surprised that you say you haven't seen this message before, as the check exists in both places and has been there since 2.5 was released. There must've been a bug that allowed it before. In this case, please try the attached patch to see if it resolves your problem for 4.0. This patch only requires you to rebuild and restart the server (dependencies are unknown to pbs_moms). David On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill wrote: > Hi everyone, > > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd > been > having some trouble with cpusets and I'd hoped that the support for hwloc > would > resolve the problem. cpusets are now working very well, but I'm having a > lot of > trouble with job arrays, which form a very large part of our workload. > > Torque 4.0.0 would regularly lock-up when processing job arrays, so I > upgraded to > the most recent 4.0.1 snapshot, and that behaves much better, but still > seems > unstable compared to 2.5.9. > > One concrete issue is that many of our jobs that worked fine with 2.5.9 > are now > stalling with 4.0.1 with the following message: > > "Arrays may only be given array dependencies" > > which only seems to appear in the server logs and is otherwise invisible. > This > was certainly never true before, and doesn't really make sense. We > frequently > use array->single job dependencies for scatter-gather type operations. > > Once the above message has been printed, the job arrays sit in a hold > state forever. > They can't be removed using qdel and if I try to break the hold using qrls > or > mjobctl, they move into the queued state, but they disappear from moab and > never > actually start, and still can't be removed. The only way I can get rid of > them > is to bring down pbs_server, which has to killed via `killall -QUIT > pbs_server` > since the init script cannot stop the process properly, and delete the job > files manually. > > I'm currently thinking of just reverting to the old, working version of > torque, > but has anyone else had trouble with job arrays and can the above problems > be > fixed somehow? > > Thanks, > > > -------------------------------------------------------------------------------- > Rhys Hill, Senior Research > Associate > Australian Centre for Visual Technologies University of > Adelaide > > Phone: +61 8 8313 6197 Mail: > Fax: +61 8 8313 4366 School of Computer > Science > University of Adelaide > Adelaide, Australia > http://www.cs.adelaide.edu.au/~rhys/ 5005 > > -------------------------------------------------------------------------------- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/c55502b2/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: ArrayDeps.patch Type: application/octet-stream Size: 754 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/c55502b2/attachment.obj From nt_mahmood at yahoo.com Mon Apr 23 11:52:50 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 23 Apr 2012 10:52:50 -0700 (PDT) Subject: [torqueusers] torque for windows In-Reply-To: References: <1335184890.17121.YahooMailNeo@web111709.mail.gq1.yahoo.com> Message-ID: <1335203570.35219.YahooMailNeo@web111722.mail.gq1.yahoo.com> Thanks guys for the info ? // Naderan *Mahmood; ----- Original Message ----- From: Andrew Caird To: Torque Users Mailing List Cc: Mahmood Naderan Sent: Monday, April 23, 2012 7:46 PM Subject: Re: [torqueusers] torque for windows On #2, there is an excellent video describing this at http://www.adaptivecomputing.com/videos/hybrid There is also other information at www.adaptivecomputing.com about doing this. --andy On Mon, Apr 23, 2012 at 10:45 AM, David Beer wrote: > These are the two closest things I know of - > > 1. TORQUE can run on cygwin thanks to Vikentsi Lapa and his colleagues at > UIIP Minsk. I don't know where Maui or whatever scheduler is run in this > setup, whether on a linux box or cygwin. > 2. Moab is able to manage a windows cluster but with the help of Microsoft > Cluster Server. > > HTH, > > David > > On Mon, Apr 23, 2012 at 6:41 AM, Mahmood Naderan > wrote: >> >> >> Dear All, >> I want to know is there a windows based version of torque? If the answer >> is no, do you have any experience on managing a windows based cluster? >> please let me know? >> >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From naveed at caltech.edu Mon Apr 23 12:49:39 2012 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Mon, 23 Apr 2012 11:49:39 -0700 Subject: [torqueusers] [Mauiusers] priority job failing to get reservation In-Reply-To: References: <4F91D4DD.2010204@caltech.edu> <4F91FB8D.2020902@caltech.edu> Message-ID: <4F95A443.5000704@caltech.edu> Thanks, I have changed it to HIGHEST and will see what happens. Reading the docs suggests to me that it probably won't do it since thise particular job has been waiting so long it has by far the highest priority and probably never dropped out of the reservation depth, but it is possible it has and lost its reservation. On 04/20/2012 08:29 PM, Lyn Gerner wrote: > No, I'm suggesting you change the RESERVATIONPOLICY to HIGHEST. > Completely different effect than CURRENTHIGHEST. > > On 4/20/12, Naveed Near-Ansari wrote: >> Thanks, >> >> no recurring reservations at all. reservation policy is already set that >> way. >> >> RESERVATIONPOLICY CURRENTHIGHEST >> >> I have been having a dicksens of a time figuring out the best policy for our >> cluster. Lots of long jobs, some small some large. >> >> Naveed Near-Ansari >> >> >> >> >> On Apr 20, 2012, at 7:41 PM, Lyn Gerner wrote: >> >>> Yes, the idle nodes do fluctuate. If you have any recurring >>> reservations (say, for a weekly maintenance window), then it may not >>> be able to find a big enough window to run a large, 4-day job, on >>> dedicated nodes. >>> >>> You might also want to check to see if RESERVATIONPOLICY is set to >>> HIGHEST, to make sure that the job keeps its priority reservation, if >>> it ever gets to the top of the queue. >>> >>> Good luck, >>> Lyn >>> >>> On 4/20/12, Naveed Near-Ansari wrote: >>>> Thanks. >>>> >>>> The idle procs actually fluctuates: >>>> >>>> job cannot run in partition DEFAULT (insufficient idle procs available: >>>> 744 >>>> < 1501) >>>> >>>> I don't think it is mapping to procs since there are 628 procs on the >>>> system >>>> (314 nodes * 2 procs) >>>> >>>> The QOS does request dedicated nodes. I have seen no issue with this on >>>> all >>>> other jobs. When someone requests 12 tasks they get 1 12 core machine. >>>> >>>> I think i may be misunderstanding how priority reservations work. Does >>>> it >>>> try to find available nodes to reserve within a timeframe and no procs >>>> will >>>> be availble within that time frame, or is it supposed to look out >>>> forever >>>> to find the procs available. We have a lot of long running processes, so >>>> if >>>> it is looking within a time frame (say a month), it may not be able to >>>> find >>>> the resources. If this is the case, is it possible to change how far >>>> ahead >>>> it looks? >>>> >>>> I couldn't find anything in the documentation that describes specifically >>>> how it finds resources for priority based reservations. >>>> >>>> Naveed Near-Ansari >>>> >>>> >>>> On Apr 20, 2012, at 5:50 PM, Lyn Gerner wrote: >>>> >>>>> So does the checkjob for 220559 still show the "insufficient idle >>>>> procs available: 1056 < 1501" msg? >>>>> >>>>> Seems like somehow the TASKS request is not mapping to cores (of which >>>>> I surmise you have 3576) but rather procs (which in the above you have >>>>> 1056). >>>>> >>>>> I am really grasping at straws on this: is the "ded" QOS requesting >>>>> dedicated nodes, and you don't have enough? >>>>> >>>>> Not sure where else to tell you to look. >>>>> >>>>> Best of luck, >>>>> Lyn >>>>> >>>>> On 4/20/12, Naveed Near-Ansari wrote: >>>>>> >>>>>> On 04/20/2012 04:23 PM, Lyn Gerner wrote: >>>>>>> Naveed, >>>>>>> >>>>>>> It looks like your setup is only showing 1056 procs, not 3552: >>>>>>>> PE: 1501.00 StartPriority: 144235 >>>>>>>> job cannot run in partition DEFAULT (insufficient idle procs >>>>>>>> available: >>>>>>>> 1056 < 1501) >>>>>>> You might play w/diagnose -t (partition) and diagnose -j (job) to see >>>>>>> what they tell you. Also, you could try to explicitly make a >>>>>>> reservation for the job, and maybe then you could get info from >>>>>>> diagnose -r (though attempting the setres may give enough error info). >>>>>>> >>>>>>> Good luck, >>>>>>> Lyn >>>>>> Thanks for looking. >>>>>> >>>>>> I think it is configured for 3768 (i said 3552 because the queue it was >>>>>> sent to has that many available to it). i didn't see anything clear in >>>>>> either diagnose command. I attempted to create a reservation, but it >>>>>> failed. >>>>>> >>>>>> # setres -u ortega -d 4:00:00:00 TASKS==1501 >>>>>> ERROR: 'setres' failed >>>>>> ERROR: cannot select 1501 tasks for reservation for 3:13:33:56 >>>>>> ERROR: cannot select requested tasks for 'TASKS==1501' >>>>>> >>>>>> >>>>>> >>>>>> #diagnose -t >>>>>> Displaying Partition Status >>>>>> >>>>>> System Partition Settings: PList: DEFAULT PDef: DEFAULT >>>>>> >>>>>> Name Procs >>>>>> >>>>>> DEFAULT 3768 >>>>>> >>>>>> Partition Configured Up U/C Dedicated D/U >>>>>> Active A/U >>>>>> >>>>>> NODE---------------------------------------------------------------------------- >>>>>> DEFAULT 314 313 99.68% 297 94.89% >>>>>> 297 94.89% >>>>>> PROC---------------------------------------------------------------------------- >>>>>> DEFAULT 3768 3756 99.68% 3564 94.89% >>>>>> 3000 79.87% >>>>>> MEM---------------------------------------------------------------------------- >>>>>> DEFAULT 15156264 15107978 99.68% 14335282 94.89% >>>>>> 0 0.00% >>>>>> SWAP---------------------------------------------------------------------------- >>>>>> DEFAULT 30227950 30131665 99.68% 28590985 94.89% >>>>>> 1400704 4.65% >>>>>> DISK---------------------------------------------------------------------------- >>>>>> DEFAULT 314 313 99.68% 297 94.89% >>>>>> 0 0.00% >>>>>> >>>>>> Class/Queue State >>>>>> >>>>>> [ :]... >>>>>> >>>>>> DEFAULT [shared 3756:3756][debug 3756:3756][default 477:3756][gpu >>>>>> 3756:3756] >>>>>> >>>>>> >>>>>> >>>>>> #diagnose -j 220559 >>>>>> Name State Par Proc QOS WCLimit R Min User >>>>>> Group Account QueuedTime Network Opsys Arch Mem Disk >>>>>> Procs Class Features >>>>>> >>>>>> 220559 Idle ALL 1501 ded 4:00:00:00 0 1501 ortega >>>>>> simons - 1:23:34:41 [NONE] [NONE] [NONE] >=0 >=0 NC0 >>>>>> [default:1] [default] >>>>>> >>>>>> >>>>>> >>>>>> >>>> >> -- Naveed Near-Ansari E: naveed at caltech.edu O: 626-395-2212 M: 626-394-3845 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4887 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/12621bf7/attachment.bin From samuel at unimelb.edu.au Mon Apr 23 21:12:49 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 24 Apr 2012 13:12:49 +1000 Subject: [torqueusers] doubt of running job on infiniband network In-Reply-To: <242421BFAF465844BE24EB90BB97E2210908F73A@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210908F73A@ITSDAG1D.its.iastate.edu> Message-ID: <4F961A31.8090304@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 24/04/12 03:04, Coyle, James J [ITACD] wrote: > For openmpi compiled this way, communication uses IB if it is > available, otherwise it will use IP (Ethernet) communication. You can verify that it is doing so by using lsof on the process and looking for it accessing the uverbs device, thus: [root at merri001 ~]# lsof -p 22234 | grep uverbs namd2 22234 samuel mem CHR 231,192 6424 /dev/infiniband/uverbs0 namd2 22234 samuel 12u CHR 231,192 0t0 6424 /dev/infiniband/uverbs0 - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+WGjEACgkQO2KABBYQAh9SuACbBmkw92AlhCo8NVlTMKK2HUo4 XMgAn2rIPfvqpzF1OG6GrQwx5U1r1SDy =VyZH -----END PGP SIGNATURE----- From jascha.wang at gmail.com Mon Apr 23 23:17:03 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Tue, 24 Apr 2012 13:17:03 +0800 Subject: [torqueusers] doubt of running job on infiniband network In-Reply-To: <4F961A31.8090304@unimelb.edu.au> References: <242421BFAF465844BE24EB90BB97E2210908F73A@ITSDAG1D.its.iastate.edu> <4F961A31.8090304@unimelb.edu.au> Message-ID: Thanks for you, James and Christopher. And it seems that I need no more configuration to torque. ? 2012-4-24 ??11:13?"Christopher Samuel" ??? > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 24/04/12 03:04, Coyle, James J [ITACD] wrote: > > > For openmpi compiled this way, communication uses IB if it is > > available, otherwise it will use IP (Ethernet) communication. > > You can verify that it is doing so by using lsof on the process and > looking for it accessing the uverbs device, thus: > > [root at merri001 ~]# lsof -p 22234 | grep uverbs > namd2 22234 samuel mem CHR 231,192 6424 > /dev/infiniband/uverbs0 > namd2 22234 samuel 12u CHR 231,192 0t0 6424 > /dev/infiniband/uverbs0 > > > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk+WGjEACgkQO2KABBYQAh9SuACbBmkw92AlhCo8NVlTMKK2HUo4 > XMgAn2rIPfvqpzF1OG6GrQwx5U1r1SDy > =VyZH > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/e06b9ddf/attachment.html From rhys.hill at adelaide.edu.au Tue Apr 24 00:23:55 2012 From: rhys.hill at adelaide.edu.au (Rhys Hill) Date: Tue, 24 Apr 2012 06:23:55 +0000 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: References: Message-ID: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au> Hi David, Thanks for that. I've just found and fixed some other bugs which I've added to bugzilla. The one issue that remains is odd. It seems that we have a situation where an array is stuck, when all of it's component jobs are finished. For instance, qstat -f says this: Job Id: 678[].moby.cs.adelaide.edu.au Job_Name = YZ_Oxford_group Job_Owner = yanzhichen at moby.cs.adelaide.edu.au job_state = Q queue = batch server = moby.cs.adelaide.edu.au Checkpoint = u ctime = Tue Apr 24 09:26:10 2012 Error_Path = moby.cs.adelaide.edu.au:/home/yanzhichen/moby/oxbuilding_voca bulary/out.e.txt Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Apr 24 09:26:10 2012 Output_Path = moby.cs.adelaide.edu.au:/home/yanzhichen/moby/oxbuilding_voc abulary/out.o.txt Priority = 0 qtime = Tue Apr 24 09:26:10 2012 Rerunable = True Resource_List.mem = 5gb Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.pmem = 5gb Resource_List.pvmem = 8gb Resource_List.walltime = 48:00:00 etime = Tue Apr 24 09:26:10 2012 submit_args = -t 2-11 ./job_dogroup job_array_request = 2-11 fault_tolerant = False job_radix = 0 submit_host = moby.cs.adelaide.edu.au init_work_dir = /home/yanzhichen/moby/oxbuilding_vocabulary whereas qstat -ft has no mention of 678[x] at all. qdel and qdel -p have no effect on jobs like these. I think I've submitted a fix for the problem that leads to the job getting into this state, but it would be handy if qdel could remove it. Thanks, On 24/04/2012, at 2:52 AM, David Beer wrote: > Rhys, > > I'm surprised that you say you haven't seen this message before, as the check exists in both places and has been there since 2.5 was released. There must've been a bug that allowed it before. In this case, please try the attached patch to see if it resolves your problem for 4.0. This patch only requires you to rebuild and restart the server (dependencies are unknown to pbs_moms). > > David > > On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill wrote: > Hi everyone, > > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd been > having some trouble with cpusets and I'd hoped that the support for hwloc would > resolve the problem. cpusets are now working very well, but I'm having a lot of > trouble with job arrays, which form a very large part of our workload. > > Torque 4.0.0 would regularly lock-up when processing job arrays, so I upgraded to > the most recent 4.0.1 snapshot, and that behaves much better, but still seems > unstable compared to 2.5.9. > > One concrete issue is that many of our jobs that worked fine with 2.5.9 are now > stalling with 4.0.1 with the following message: > > "Arrays may only be given array dependencies" > > which only seems to appear in the server logs and is otherwise invisible. This > was certainly never true before, and doesn't really make sense. We frequently > use array->single job dependencies for scatter-gather type operations. > > Once the above message has been printed, the job arrays sit in a hold state forever. > They can't be removed using qdel and if I try to break the hold using qrls or > mjobctl, they move into the queued state, but they disappear from moab and never > actually start, and still can't be removed. The only way I can get rid of them > is to bring down pbs_server, which has to killed via `killall -QUIT pbs_server` > since the init script cannot stop the process properly, and delete the job > files manually. > > I'm currently thinking of just reverting to the old, working version of torque, > but has anyone else had trouble with job arrays and can the above problems be > fixed somehow? > > Thanks, > > -------------------------------------------------------------------------------- > Rhys Hill, Senior Research Associate > Australian Centre for Visual Technologies University of Adelaide > > Phone: +61 8 8313 6197 Mail: > Fax: +61 8 8313 4366 School of Computer Science > University of Adelaide > Adelaide, Australia > http://www.cs.adelaide.edu.au/~rhys/ 5005 > -------------------------------------------------------------------------------- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------------------------------------------------------------------------- Rhys Hill, Senior Research Associate Australian Centre for Visual Technologies University of Adelaide Phone: +61 8 8313 6197 Mail: Fax: +61 8 8313 4366 School of Computer Science University of Adelaide Adelaide, Australia http://www.cs.adelaide.edu.au/~rhys/ 5005 -------------------------------------------------------------------------------- From stijn.deweirdt at ugent.be Tue Apr 24 03:44:32 2012 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Tue, 24 Apr 2012 11:44:32 +0200 Subject: [torqueusers] torque 4.0.1 snapshot build issues Message-ID: <4F967600.8080606@ugent.be> hi all, i'm trying to get started with torque 4, and i am starting with the latest snapshot (to avoid some of the known issues of the 4.0.0 release). however, i'm stuck building the rpms: a. the torque.spec file isn't that snapshot friendly (you can't override tarversion from commandline etc etc). patch in attachement to be able to do rpmbuild -ba --define 'snapversion 201204031702' torque.spec patch in attachement b. building the spec file with drmaa support fails when making the src/drmaa/test with ../../../src/drmaa/src/.libs/libdrmaa.so: undefined reference to `pbs_statjob_err' the makefile doesn't link with ../../../src/lib/Libpbs/.libs/libtorque.so patch in attachment stijn -------------- next part -------------- A non-text attachment was scrubbed... Name: torque_4_snap.patch Type: text/x-patch Size: 635 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/e0cbdc3d/attachment.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: torque_4_drmaa_test.patch Type: text/x-patch Size: 2355 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/e0cbdc3d/attachment-0001.bin From roy.dragseth at cc.uit.no Tue Apr 24 06:34:00 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 24 Apr 2012 14:34 +0200 Subject: [torqueusers] pbs_server crashes when deleting and adding nodes. Message-ID: <1460284.hyFAUKa0Pm@newton.cc.uit.no> System specs: CentOS 6.2, torque v3.0.4 pbs_server segfaults with the backtrace listed below when deleting and re- adding nodes with qmgr: [root at hpc1 ~]# qmgr < /tmp/addnodes.sh Max open servers: 10239 delete node compute-0-0 qmgr obj=compute-0-0 svr=default: Unknown node create node compute-0-0 np=2,ntype=cluster delete node compute-0-1 create node compute-0-1 np=2,ntype=cluster qmgr obj=compute-0-1 svr=default: End of File delete node compute-0-2 [root at hpc1 ~]# cat /tmp/addnodes.sh delete node compute-0-0 create node compute-0-0 np=2,ntype=cluster delete node compute-0-1 create node compute-0-1 np=2,ntype=cluster delete node compute-0-2 create node compute-0-2 np=2,ntype=cluster Backtrace [root at hpc1 torque-3.0.4]# gdb /opt/torque/sbin/pbs_server GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /opt/torque/sbin/pbs_server...done. (gdb) run Starting program: /opt/torque/sbin/pbs_server pbs_server is up Program received signal SIGSEGV, Segmentation fault. 0x0000000000412a8c in update_nodes_file () at node_func.c:1314 1314 if (np->nd_state & INUSE_DELETED) Missing separate debuginfos, use: debuginfo-install torque-3.0.4-1.x86_64 (gdb) bt #0 0x0000000000412a8c in update_nodes_file () at node_func.c:1314 #1 0x0000000000430328 in mgr_node_create (preq=0x11c1e30) at req_manager.c:2209 #2 0x0000000000430420 in req_manager (preq=0x11c1e30) at req_manager.c:2271 #3 0x0000000000424010 in dispatch_request (sfds=14, request=0x11c1e30) at process_request.c:862 #4 0x0000000000423e65 in process_request (sfds=14) at process_request.c:734 #5 0x00002aaaaaad6164 in wait_request (waittime=32, SState=0x7409d8) at ../Libnet/net_server.c:507 #6 0x0000000000420f66 in main_loop () at pbsd_main.c:1238 #7 0x0000000000421c8d in main (argc=1, argv=0x7fffffffe2d8) at pbsd_main.c:1793 (gdb) list 1309 1310 for (i = 0;i < svr_totnodes;++i) 1311 { 1312 np = pbsndmast[i]; 1313 1314 if (np->nd_state & INUSE_DELETED) 1315 continue; 1316 1317 /* ... write its name, and if time-shared, append :ts */ 1318 (gdb) cont Continuing. Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists. In Rocks this means that the pbs_server crashes every time one runs rocks sync config which refreshes the cluster config files. Any clues? r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From roy.dragseth at cc.uit.no Tue Apr 24 07:19:06 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 24 Apr 2012 15:19:06 +0200 Subject: [torqueusers] pbs_server crashes when deleting and adding nodes. In-Reply-To: <1460284.hyFAUKa0Pm@newton.cc.uit.no> References: <1460284.hyFAUKa0Pm@newton.cc.uit.no> Message-ID: <1403691.J1VpU34d66@newton.cc.uit.no> Hmm, the problem seems to be solved in the 3.0.5 release candidate announced earlier. Any projections on the release date of 3.0.5? r. From christina.salls at noaa.gov Tue Apr 24 07:35:49 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 09:35:49 -0400 Subject: [torqueusers] assigning resources to a queue Message-ID: Hi all, I am new to torque and have implemented a very simple configuration which has been working just fine so far. I have torque installed on a single 20 node cluster using the built in scheduler. At this point I have a few users that are submitting jobs to a single default queue. I have a new user that is going to have one node "assigned" to him. I was planning to configure a separate queue for his jobs, and ask the users to request the queue as part of their queue submission script. Is there a way to assign a particular node to a queue? And the rest of the nodes to the other queue? Would this be the simplest way to accomplish the objective? Thanks in advance. Christina ps. [root at wings rpy2-2.2.0]# qmgr Max open servers: 10239 Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue hydrology # create queue hydrology set queue hydrology queue_type = Execution set queue hydrology enabled = True set queue hydrology started = True # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = admin.default.domain set server acl_hosts += wings.glerl.noaa.gov set server managers = root at wings.glerl.noaa.gov set server managers += root@* set server managers += salls at wings.glerl.noaa.gov set server operators = root at wings.glerl.noaa.gov set server operators += salls at wings.glerl.noaa.gov set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 100 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 0 set server auto_node_np = True set server next_job_number = 760 set server np_default = 12 set server record_job_info = True set server record_job_script = True set server job_log_keep_days = 10 [root at wings rpy2-2.2.0]# pbsnodes -a n001 state = free np = 12 ntype = cluster status = rectime=1335274459,varattr=,jobs=,state=free,netload=586401282,gres=,loadave=0.00,ncpus=12,physmem=20463136kb,availmem=27850272kb,totmem=28655128kb,idletime=510345,nusers=1,nsessions=1,sessions=30936,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n002 state = free np = 12 ntype = cluster status = rectime=1335274453,varattr=,jobs=,state=free,netload=766967700,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31910444kb,totmem=32792076kb,idletime=510298,nusers=1,nsessions=1,sessions=29658,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n003 state = free np = 12 ntype = cluster status = rectime=1335274478,varattr=,jobs=,state=free,netload=123926745,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31938444kb,totmem=32792076kb,idletime=510352,nusers=1,nsessions=1,sessions=29463,uname=Linux n003 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n004 state = free np = 12 ntype = cluster status = rectime=1335274486,varattr=,jobs=,state=free,netload=5979387196,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31914772kb,totmem=32792076kb,idletime=510379,nusers=1,nsessions=1,sessions=29625,uname=Linux n004 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n005 state = free np = 12 ntype = cluster status = rectime=1335274467,varattr=,jobs=,state=free,netload=123646737,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31920932kb,totmem=32792076kb,idletime=510320,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n005 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n006 state = free np = 12 ntype = cluster status = rectime=1335274483,varattr=,jobs=,state=free,netload=123497908,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31958328kb,totmem=32792076kb,idletime=510349,nusers=1,nsessions=1,sessions=29461,uname=Linux n006 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n007 state = free np = 12 ntype = cluster status = rectime=1335274454,varattr=,jobs=,state=free,netload=124286477,gres=,loadave=0.09,ncpus=12,physmem=24600084kb,availmem=31923372kb,totmem=32792076kb,idletime=510340,nusers=1,nsessions=1,sessions=29488,uname=Linux n007 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n008 state = free np = 12 ntype = cluster status = rectime=1335274460,varattr=,jobs=,state=free,netload=123313535,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31924724kb,totmem=32792076kb,idletime=510321,nusers=1,nsessions=1,sessions=29439,uname=Linux n008 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n009 state = free np = 12 ntype = cluster status = rectime=1335274450,varattr=,jobs=,state=free,netload=117452022,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31925280kb,totmem=32792076kb,idletime=510335,nusers=1,nsessions=1,sessions=28949,uname=Linux n009 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n010 state = free np = 12 ntype = cluster status = rectime=1335274466,varattr=,jobs=,state=free,netload=117146413,gres=,loadave=0.07,ncpus=12,physmem=24600084kb,availmem=31920892kb,totmem=32792076kb,idletime=510339,nusers=1,nsessions=1,sessions=28894,uname=Linux n010 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n011 state = free np = 12 ntype = cluster status = rectime=1335274484,varattr=,jobs=,state=free,netload=116981665,gres=,loadave=0.03,ncpus=12,physmem=24600084kb,availmem=31918964kb,totmem=32792076kb,idletime=510345,nusers=1,nsessions=1,sessions=28907,uname=Linux n011 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n012 state = free np = 12 ntype = cluster status = rectime=1335274474,varattr=,jobs=,state=free,netload=58328745076,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31890504kb,totmem=32792076kb,idletime=510341,nusers=1,nsessions=1,sessions=29498,uname=Linux n012 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n013 state = free np = 12 ntype = cluster status = rectime=1335274461,varattr=,jobs=,state=free,netload=115120292,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31919392kb,totmem=32792076kb,idletime=510344,nusers=1,nsessions=1,sessions=28869,uname=Linux n013 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n014 state = free np = 12 ntype = cluster status = rectime=1335274471,varattr=,jobs=,state=free,netload=114635782,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31921548kb,totmem=32792076kb,idletime=510327,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n014 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n015 state = free np = 12 ntype = cluster status = rectime=1335274481,varattr=,jobs=,state=free,netload=114077034,gres=,loadave=0.02,ncpus=12,physmem=24600084kb,availmem=31921576kb,totmem=32792076kb,idletime=510350,nusers=1,nsessions=1,sessions=28897,uname=Linux n015 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n016 state = free np = 12 ntype = cluster status = rectime=1335274470,varattr=,jobs=,state=free,netload=114293556,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31926780kb,totmem=32792076kb,idletime=510351,nusers=1,nsessions=1,sessions=28918,uname=Linux n016 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n017 state = free np = 12 ntype = cluster status = rectime=1335274484,varattr=,jobs=,state=free,netload=114418798,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31919716kb,totmem=32792076kb,idletime=510376,nusers=1,nsessions=1,sessions=28922,uname=Linux n017 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n018 state = free np = 12 ntype = cluster status = rectime=1335274470,varattr=,jobs=,state=free,netload=257640357666,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31861476kb,totmem=32792076kb,idletime=510314,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n018 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n019 state = free np = 12 ntype = cluster status = rectime=1335274475,varattr=,jobs=,state=free,netload=109792676,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31924640kb,totmem=32792076kb,idletime=510345,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n019 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n020 state = free np = 12 ntype = cluster status = rectime=1335274469,varattr=,jobs=,state=free,netload=58239290441,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31868680kb,totmem=32792076kb,idletime=510354,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n020 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/7a502bce/attachment-0001.html From knielson at adaptivecomputing.com Tue Apr 24 09:07:24 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 24 Apr 2012 09:07:24 -0600 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au> References: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au> Message-ID: On Tue, Apr 24, 2012 at 12:23 AM, Rhys Hill wrote: > Hi David, > > Thanks for that. I've just found and fixed some other bugs which I've > added to > bugzilla. The one issue that remains is odd. It seems that we have a > situation > where an array is stuck, when all of it's component jobs are finished. > > For instance, qstat -f says this: > > Job Id: 678[].moby.cs.adelaide.edu.au > Job_Name = YZ_Oxford_group > Job_Owner = yanzhichen at moby.cs.adelaide.edu.au > job_state = Q > queue = batch > server = moby.cs.adelaide.edu.au > Checkpoint = u > ctime = Tue Apr 24 09:26:10 2012 > Error_Path = moby.cs.adelaide.edu.au: > /home/yanzhichen/moby/oxbuilding_voca > bulary/out.e.txt > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = a > mtime = Tue Apr 24 09:26:10 2012 > Output_Path = moby.cs.adelaide.edu.au: > /home/yanzhichen/moby/oxbuilding_voc > abulary/out.o.txt > Priority = 0 > qtime = Tue Apr 24 09:26:10 2012 > Rerunable = True > Resource_List.mem = 5gb > Resource_List.nodect = 1 > Resource_List.nodes = 1:ppn=1 > Resource_List.pmem = 5gb > Resource_List.pvmem = 8gb > Resource_List.walltime = 48:00:00 > etime = Tue Apr 24 09:26:10 2012 > submit_args = -t 2-11 ./job_dogroup > job_array_request = 2-11 > fault_tolerant = False > job_radix = 0 > submit_host = moby.cs.adelaide.edu.au > init_work_dir = /home/yanzhichen/moby/oxbuilding_vocabulary > > whereas qstat -ft has no mention of 678[x] at all. qdel and qdel -p have > no effect > on jobs like these. I think I've submitted a fix for the problem that > leads to the > job getting into this state, but it would be handy if qdel could remove it. > > Rhys, To delete an element of the array or list all of the elements in an array you need to use the -t option. For example qstat -t will not only list the array master but every job in the array and its current state. qdel is the same. You need to use qdel -t to delete an individual job in the array. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/2420e2b5/attachment.html From dbeer at adaptivecomputing.com Tue Apr 24 09:38:21 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 24 Apr 2012 09:38:21 -0600 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: Message-ID: Christina To make the hydrology queue request by default - 'set queue hydrology resources_default.need_nodes=' David On Tue, Apr 24, 2012 at 7:35 AM, Christina Salls wrote: > Hi all, > > I am new to torque and have implemented a very simple configuration > which has been working just fine so far. I have torque installed on a > single 20 node cluster using the built in scheduler. At this point I have > a few users that are submitting jobs to a single default queue. I have a > new user that is going to have one node "assigned" to him. I was planning > to configure a separate queue for his jobs, and ask the users to request > the queue as part of their queue submission script. Is there a way to > assign a particular node to a queue? And the rest of the nodes to the > other queue? Would this be the simplest way to accomplish the objective? > Thanks in advance. > > Christina > > ps. > > [root at wings rpy2-2.2.0]# qmgr > Max open servers: 10239 > Qmgr: print server > # > # Create queues and set their attributes. > # > # > # Create and define queue hydrology > # > create queue hydrology > set queue hydrology queue_type = Execution > set queue hydrology enabled = True > set queue hydrology started = True > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = admin.default.domain > set server acl_hosts += wings.glerl.noaa.gov > set server managers = root at wings.glerl.noaa.gov > set server managers += root@* > set server managers += salls at wings.glerl.noaa.gov > set server operators = root at wings.glerl.noaa.gov > set server operators += salls at wings.glerl.noaa.gov > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 100 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 0 > set server auto_node_np = True > set server next_job_number = 760 > set server np_default = 12 > set server record_job_info = True > set server record_job_script = True > set server job_log_keep_days = 10 > > [root at wings rpy2-2.2.0]# pbsnodes -a > n001 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274459,varattr=,jobs=,state=free,netload=586401282,gres=,loadave=0.00,ncpus=12,physmem=20463136kb,availmem=27850272kb,totmem=28655128kb,idletime=510345,nusers=1,nsessions=1,sessions=30936,uname=Linux > n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n002 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274453,varattr=,jobs=,state=free,netload=766967700,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31910444kb,totmem=32792076kb,idletime=510298,nusers=1,nsessions=1,sessions=29658,uname=Linux > n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n003 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274478,varattr=,jobs=,state=free,netload=123926745,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31938444kb,totmem=32792076kb,idletime=510352,nusers=1,nsessions=1,sessions=29463,uname=Linux > n003 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n004 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274486,varattr=,jobs=,state=free,netload=5979387196,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31914772kb,totmem=32792076kb,idletime=510379,nusers=1,nsessions=1,sessions=29625,uname=Linux > n004 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n005 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274467,varattr=,jobs=,state=free,netload=123646737,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31920932kb,totmem=32792076kb,idletime=510320,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux n005 2.6.32-131.0.15.el6.x86_64 #1 SMP > Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > n006 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274483,varattr=,jobs=,state=free,netload=123497908,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31958328kb,totmem=32792076kb,idletime=510349,nusers=1,nsessions=1,sessions=29461,uname=Linux > n006 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n007 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274454,varattr=,jobs=,state=free,netload=124286477,gres=,loadave=0.09,ncpus=12,physmem=24600084kb,availmem=31923372kb,totmem=32792076kb,idletime=510340,nusers=1,nsessions=1,sessions=29488,uname=Linux > n007 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n008 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274460,varattr=,jobs=,state=free,netload=123313535,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31924724kb,totmem=32792076kb,idletime=510321,nusers=1,nsessions=1,sessions=29439,uname=Linux > n008 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n009 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274450,varattr=,jobs=,state=free,netload=117452022,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31925280kb,totmem=32792076kb,idletime=510335,nusers=1,nsessions=1,sessions=28949,uname=Linux > n009 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n010 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274466,varattr=,jobs=,state=free,netload=117146413,gres=,loadave=0.07,ncpus=12,physmem=24600084kb,availmem=31920892kb,totmem=32792076kb,idletime=510339,nusers=1,nsessions=1,sessions=28894,uname=Linux > n010 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n011 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274484,varattr=,jobs=,state=free,netload=116981665,gres=,loadave=0.03,ncpus=12,physmem=24600084kb,availmem=31918964kb,totmem=32792076kb,idletime=510345,nusers=1,nsessions=1,sessions=28907,uname=Linux > n011 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n012 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274474,varattr=,jobs=,state=free,netload=58328745076,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31890504kb,totmem=32792076kb,idletime=510341,nusers=1,nsessions=1,sessions=29498,uname=Linux > n012 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n013 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274461,varattr=,jobs=,state=free,netload=115120292,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31919392kb,totmem=32792076kb,idletime=510344,nusers=1,nsessions=1,sessions=28869,uname=Linux > n013 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n014 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274471,varattr=,jobs=,state=free,netload=114635782,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31921548kb,totmem=32792076kb,idletime=510327,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux n014 2.6.32-131.0.15.el6.x86_64 #1 SMP > Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > n015 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274481,varattr=,jobs=,state=free,netload=114077034,gres=,loadave=0.02,ncpus=12,physmem=24600084kb,availmem=31921576kb,totmem=32792076kb,idletime=510350,nusers=1,nsessions=1,sessions=28897,uname=Linux > n015 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n016 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274470,varattr=,jobs=,state=free,netload=114293556,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31926780kb,totmem=32792076kb,idletime=510351,nusers=1,nsessions=1,sessions=28918,uname=Linux > n016 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n017 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274484,varattr=,jobs=,state=free,netload=114418798,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31919716kb,totmem=32792076kb,idletime=510376,nusers=1,nsessions=1,sessions=28922,uname=Linux > n017 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 > x86_64,opsys=linux > gpus = 0 > > n018 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274470,varattr=,jobs=,state=free,netload=257640357666,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31861476kb,totmem=32792076kb,idletime=510314,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux n018 2.6.32-131.0.15.el6.x86_64 #1 SMP > Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > n019 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274475,varattr=,jobs=,state=free,netload=109792676,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31924640kb,totmem=32792076kb,idletime=510345,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux n019 2.6.32-131.0.15.el6.x86_64 #1 SMP > Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > n020 > state = free > np = 12 > ntype = cluster > status = > rectime=1335274469,varattr=,jobs=,state=free,netload=58239290441,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31868680kb,totmem=32792076kb,idletime=510354,nusers=0,nsessions=? > 15201,sessions=? 15201,uname=Linux n020 2.6.32-131.0.15.el6.x86_64 #1 SMP > Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/ebf85aeb/attachment-0001.html From dbeer at adaptivecomputing.com Tue Apr 24 09:46:34 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 24 Apr 2012 09:46:34 -0600 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au> References: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au> Message-ID: Rhys, Just to confirm - that patch fixed your problem? If so I will see that it gets checked in. We will look at these other bugzilla issues that you created. Thanks for taking the time to report them and in some cases offer solutions. We really appreciate the effort to help make TORQUE better. David On Tue, Apr 24, 2012 at 12:23 AM, Rhys Hill wrote: > Hi David, > > Thanks for that. I've just found and fixed some other bugs which I've > added to > bugzilla. The one issue that remains is odd. It seems that we have a > situation > where an array is stuck, when all of it's component jobs are finished. > > For instance, qstat -f says this: > > Job Id: 678[].moby.cs.adelaide.edu.au > Job_Name = YZ_Oxford_group > Job_Owner = yanzhichen at moby.cs.adelaide.edu.au > job_state = Q > queue = batch > server = moby.cs.adelaide.edu.au > Checkpoint = u > ctime = Tue Apr 24 09:26:10 2012 > Error_Path = moby.cs.adelaide.edu.au: > /home/yanzhichen/moby/oxbuilding_voca > bulary/out.e.txt > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = a > mtime = Tue Apr 24 09:26:10 2012 > Output_Path = moby.cs.adelaide.edu.au: > /home/yanzhichen/moby/oxbuilding_voc > abulary/out.o.txt > Priority = 0 > qtime = Tue Apr 24 09:26:10 2012 > Rerunable = True > Resource_List.mem = 5gb > Resource_List.nodect = 1 > Resource_List.nodes = 1:ppn=1 > Resource_List.pmem = 5gb > Resource_List.pvmem = 8gb > Resource_List.walltime = 48:00:00 > etime = Tue Apr 24 09:26:10 2012 > submit_args = -t 2-11 ./job_dogroup > job_array_request = 2-11 > fault_tolerant = False > job_radix = 0 > submit_host = moby.cs.adelaide.edu.au > init_work_dir = /home/yanzhichen/moby/oxbuilding_vocabulary > > whereas qstat -ft has no mention of 678[x] at all. qdel and qdel -p have > no effect > on jobs like these. I think I've submitted a fix for the problem that > leads to the > job getting into this state, but it would be handy if qdel could remove it. > > Thanks, > > On 24/04/2012, at 2:52 AM, David Beer wrote: > > > Rhys, > > > > I'm surprised that you say you haven't seen this message before, as the > check exists in both places and has been there since 2.5 was released. > There must've been a bug that allowed it before. In this case, please try > the attached patch to see if it resolves your problem for 4.0. This patch > only requires you to rebuild and restart the server (dependencies are > unknown to pbs_moms). > > > > David > > > > On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill > wrote: > > Hi everyone, > > > > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because > we'd been > > having some trouble with cpusets and I'd hoped that the support for > hwloc would > > resolve the problem. cpusets are now working very well, but I'm having a > lot of > > trouble with job arrays, which form a very large part of our workload. > > > > Torque 4.0.0 would regularly lock-up when processing job arrays, so I > upgraded to > > the most recent 4.0.1 snapshot, and that behaves much better, but still > seems > > unstable compared to 2.5.9. > > > > One concrete issue is that many of our jobs that worked fine with 2.5.9 > are now > > stalling with 4.0.1 with the following message: > > > > "Arrays may only be given array dependencies" > > > > which only seems to appear in the server logs and is otherwise > invisible. This > > was certainly never true before, and doesn't really make sense. We > frequently > > use array->single job dependencies for scatter-gather type operations. > > > > Once the above message has been printed, the job arrays sit in a hold > state forever. > > They can't be removed using qdel and if I try to break the hold using > qrls or > > mjobctl, they move into the queued state, but they disappear from moab > and never > > actually start, and still can't be removed. The only way I can get rid > of them > > is to bring down pbs_server, which has to killed via `killall -QUIT > pbs_server` > > since the init script cannot stop the process properly, and delete the > job > > files manually. > > > > I'm currently thinking of just reverting to the old, working version of > torque, > > but has anyone else had trouble with job arrays and can the above > problems be > > fixed somehow? > > > > Thanks, > > > > > -------------------------------------------------------------------------------- > > Rhys Hill, Senior Research > Associate > > Australian Centre for Visual Technologies University of > Adelaide > > > > Phone: +61 8 8313 6197 Mail: > > Fax: +61 8 8313 4366 School of Computer > Science > > University of Adelaide > > Adelaide, Australia > > http://www.cs.adelaide.edu.au/~rhys/ 5005 > > > -------------------------------------------------------------------------------- > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > -------------------------------------------------------------------------------- > Rhys Hill, Senior Research > Associate > Australian Centre for Visual Technologies University of > Adelaide > > Phone: +61 8 8313 6197 Mail: > Fax: +61 8 8313 4366 School of Computer > Science > University of Adelaide > Adelaide, Australia > http://www.cs.adelaide.edu.au/~rhys/ 5005 > > -------------------------------------------------------------------------------- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/09b8476f/attachment.html From christina.salls at noaa.gov Tue Apr 24 09:50:39 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 11:50:39 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 11:38 AM, David Beer wrote: > Christina > > To make the hydrology queue request by default - > > 'set queue hydrology resources_default.need_nodes=' > > David > > Thanks David!! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/9e4c919b/attachment.html From christina.salls at noaa.gov Tue Apr 24 10:00:16 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 12:00:16 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: Message-ID: Hi David, I tried using that synax and had a problem: Qmgr: set queue hydrology resources_default.need_nodes= qmgr obj=hydrology svr=default: Unknown resource type resources_default.need_nodes This is what ended up working for me. Thanks again for your help! Qmgr: set queue hydrology resources_default.neednodes= Qmgr: print queue hydrology # # Create queues and set their attributes. # # # Create and define queue hydrology # create queue hydrology set queue hydrology queue_type = Execution set queue hydrology resources_default.neednodes = set queue hydrology enabled = True set queue hydrology started = True Qmgr: On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls wrote: > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer wrote: > >> Christina >> >> To make the hydrology queue request by default - >> >> 'set queue hydrology resources_default.need_nodes=' >> >> David >> >> > Thanks David!! > > > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/656b8f6c/attachment-0001.html From gus at ldeo.columbia.edu Tue Apr 24 10:10:16 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 24 Apr 2012 12:10:16 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: Message-ID: <4F96D068.5050401@ldeo.columbia.edu> On 04/24/2012 12:00 PM, Christina Salls wrote: > Hi David, > > I tried using that synax and had a problem: > > Qmgr: set queue hydrology resources_default.need_nodes= The syntax takes just the node name, I believe: set queue hydrology resources_default.need_nodes=n020 > qmgr obj=hydrology svr=default: Unknown resource type > resources_default.need_nodes > > This is what ended up working for me. Thanks again for your help! > > Qmgr: set queue hydrology resources_default.neednodes= > Qmgr: print queue hydrology > # > # Create queues and set their attributes. > # > # > # Create and define queue hydrology > # > create queue hydrology > set queue hydrology queue_type = Execution > set queue hydrology resources_default.neednodes = > set queue hydrology enabled = True > set queue hydrology started = True > Qmgr: > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > wrote: > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > wrote: > > Christina > > To make the hydrology queue request by default - > > 'set queue hydrology resources_default.need_nodes=' > > David > > > Thanks David!! > > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From christina.salls at noaa.gov Tue Apr 24 10:21:41 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 12:21:41 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: <4F96D068.5050401@ldeo.columbia.edu> References: <4F96D068.5050401@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa wrote: > On 04/24/2012 12:00 PM, Christina Salls wrote: > > Hi David, > > > > I tried using that synax and had a problem: > > > > Qmgr: set queue hydrology resources_default.need_nodes= > > The syntax takes just the node name, I believe: > > set queue hydrology resources_default.need_nodes=n020 > > > Thanks Gus, I unset the resource and reset it without the brackets. Qmgr: unset queue hydrology resources_default.neednodes Qmgr: set queue hydrology resources_default.neednodes= n020 Qmgr: print queue hydrology # # Create queues and set their attributes. # # # Create and define queue hydrology # create queue hydrology set queue hydrology queue_type = Execution set queue hydrology resources_default.neednodes = n020 set queue hydrology enabled = True set queue hydrology started = True > > qmgr obj=hydrology svr=default: Unknown resource type > > resources_default.need_nodes > > > > This is what ended up working for me. Thanks again for your help! > > > > Qmgr: set queue hydrology resources_default.neednodes= > > Qmgr: print queue hydrology > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue hydrology > > # > > create queue hydrology > > set queue hydrology queue_type = Execution > > set queue hydrology resources_default.neednodes = > > set queue hydrology enabled = True > > set queue hydrology started = True > > Qmgr: > > > > > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > > wrote: > > > > > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > > > wrote: > > > > Christina > > > > To make the hydrology queue request by default - > > > > 'set queue hydrology resources_default.need_nodes= node>' > > > > David > > > > > > Thanks David!! > > > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/bb4959b0/attachment.html From dbeer at adaptivecomputing.com Tue Apr 24 10:59:04 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 24 Apr 2012 10:59:04 -0600 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> Message-ID: Sorry for the confusion, I meant that you should replace with the node's name. Thanks for clarifying Gus. David On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls wrote: > > > On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa wrote: > >> On 04/24/2012 12:00 PM, Christina Salls wrote: >> > Hi David, >> > >> > I tried using that synax and had a problem: >> > >> > Qmgr: set queue hydrology resources_default.need_nodes= >> >> The syntax takes just the node name, I believe: >> >> set queue hydrology resources_default.need_nodes=n020 >> >> >> Thanks Gus, I unset the resource and reset it without the brackets. > Qmgr: unset queue hydrology resources_default.neednodes > Qmgr: set queue hydrology resources_default.neednodes= n020 > Qmgr: print queue hydrology > # > # Create queues and set their attributes. > # > # > # Create and define queue hydrology > # > create queue hydrology > set queue hydrology queue_type = Execution > set queue hydrology resources_default.neednodes = n020 > set queue hydrology enabled = True > set queue hydrology started = True > > >> > qmgr obj=hydrology svr=default: Unknown resource type >> > resources_default.need_nodes >> > >> > This is what ended up working for me. Thanks again for your help! >> > >> > Qmgr: set queue hydrology resources_default.neednodes= >> > Qmgr: print queue hydrology >> > # >> > # Create queues and set their attributes. >> > # >> > # >> > # Create and define queue hydrology >> > # >> > create queue hydrology >> > set queue hydrology queue_type = Execution >> > set queue hydrology resources_default.neednodes = >> > set queue hydrology enabled = True >> > set queue hydrology started = True >> > Qmgr: >> > >> > >> > >> > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls >> > > wrote: >> > >> > >> > >> > On Tue, Apr 24, 2012 at 11:38 AM, David Beer >> > > >> > wrote: >> > >> > Christina >> > >> > To make the hydrology queue request by default - >> > >> > 'set queue hydrology resources_default.need_nodes=> node>' >> > >> > David >> > >> > >> > Thanks David!! >> > >> > >> > >> > >> > >> > -- >> > Christina A. Salls >> > GLERL Computer Group >> > help.glerl at noaa.gov >> > Help Desk x2127 >> > Christina.Salls at noaa.gov >> > Voice Mail 734-741-2446 >> > >> > >> > >> > >> > _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/82bd5c0e/attachment-0001.html From christina.salls at noaa.gov Tue Apr 24 11:09:13 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 13:09:13 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 12:59 PM, David Beer wrote: > Sorry for the confusion, I meant that you should replace > with the node's name. Thanks for clarifying Gus. > > Of course, I should have realized that! Thanks again. Now that I have node n020 assigned to the hydrology queue, will that make it unavailable to the batch queue? > David > > > On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls < > christina.salls at noaa.gov> wrote: > >> >> >> On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa wrote: >> >>> On 04/24/2012 12:00 PM, Christina Salls wrote: >>> > Hi David, >>> > >>> > I tried using that synax and had a problem: >>> > >>> > Qmgr: set queue hydrology resources_default.need_nodes= >>> >>> The syntax takes just the node name, I believe: >>> >>> set queue hydrology resources_default.need_nodes=n020 >>> >>> >>> Thanks Gus, I unset the resource and reset it without the brackets. >> Qmgr: unset queue hydrology resources_default.neednodes >> Qmgr: set queue hydrology resources_default.neednodes= n020 >> Qmgr: print queue hydrology >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue hydrology >> # >> create queue hydrology >> set queue hydrology queue_type = Execution >> set queue hydrology resources_default.neednodes = n020 >> set queue hydrology enabled = True >> set queue hydrology started = True >> >> >>> > qmgr obj=hydrology svr=default: Unknown resource type >>> > resources_default.need_nodes >>> > >>> > This is what ended up working for me. Thanks again for your help! >>> > >>> > Qmgr: set queue hydrology resources_default.neednodes= >>> > Qmgr: print queue hydrology >>> > # >>> > # Create queues and set their attributes. >>> > # >>> > # >>> > # Create and define queue hydrology >>> > # >>> > create queue hydrology >>> > set queue hydrology queue_type = Execution >>> > set queue hydrology resources_default.neednodes = >>> > set queue hydrology enabled = True >>> > set queue hydrology started = True >>> > Qmgr: >>> > >>> > >>> > >>> > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls >>> > > wrote: >>> > >>> > >>> > >>> > On Tue, Apr 24, 2012 at 11:38 AM, David Beer >>> > > >>> > wrote: >>> > >>> > Christina >>> > >>> > To make the hydrology queue request by default >>> - >>> > >>> > 'set queue hydrology resources_default.need_nodes=>> node>' >>> > >>> > David >>> > >>> > >>> > Thanks David!! >>> > >>> > >>> > >>> > >>> > >>> > -- >>> > Christina A. Salls >>> > GLERL Computer Group >>> > help.glerl at noaa.gov >>> > Help Desk x2127 >>> > Christina.Salls at noaa.gov >>> > Voice Mail 734-741-2446 >>> > >>> > >>> > >>> > >>> > _______________________________________________ >>> > torqueusers mailing list >>> > torqueusers at supercluster.org >>> > http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> >> >> -- >> Christina A. Salls >> GLERL Computer Group >> help.glerl at noaa.gov >> Help Desk x2127 >> Christina.Salls at noaa.gov >> Voice Mail 734-741-2446 >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/85b309ba/attachment.html From gus at ldeo.columbia.edu Tue Apr 24 11:28:48 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 24 Apr 2012 13:28:48 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> Message-ID: <4F96E2D0.7000201@ldeo.columbia.edu> On 04/24/2012 01:09 PM, Christina Salls wrote: > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > wrote: > > Sorry for the confusion, I meant that you should replace node> with the node's name. Thanks for clarifying Gus. > > Of course, I should have realized that! Thanks again. > > Now that I have node n020 assigned to the hydrology queue, will that > make it unavailable to the batch queue? > In maui.cfg ENABLEMULTIREQJOBS TRUE See the Maui Admin Guide for the meaning: http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php I hope it helps, Gus Correa > David > > > On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls > > wrote: > > > > On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa > > wrote: > > On 04/24/2012 12:00 PM, Christina Salls wrote: > > Hi David, > > > > I tried using that synax and had a problem: > > > > Qmgr: set queue hydrology resources_default.need_nodes= > > The syntax takes just the node name, I believe: > > set queue hydrology resources_default.need_nodes=n020 > > > Thanks Gus, I unset the resource and reset it without the brackets. > Qmgr: unset queue hydrology resources_default.neednodes > Qmgr: set queue hydrology resources_default.neednodes= n020 > Qmgr: print queue hydrology > # > # Create queues and set their attributes. > # > # > # Create and define queue hydrology > # > create queue hydrology > set queue hydrology queue_type = Execution > set queue hydrology resources_default.neednodes = n020 > set queue hydrology enabled = True > set queue hydrology started = True > > > > qmgr obj=hydrology svr=default: Unknown resource type > > resources_default.need_nodes > > > > This is what ended up working for me. Thanks again for > your help! > > > > Qmgr: set queue hydrology resources_default.neednodes= > > Qmgr: print queue hydrology > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue hydrology > > # > > create queue hydrology > > set queue hydrology queue_type = Execution > > set queue hydrology resources_default.neednodes = > > set queue hydrology enabled = True > > set queue hydrology started = True > > Qmgr: > > > > > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > > >> wrote: > > > > > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > >> > > wrote: > > > > Christina > > > > To make the hydrology queue request node> by default - > > > > 'set queue hydrology > resources_default.need_nodes=' > > > > David > > > > > > Thanks David!! > > > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > > > Help Desk x2127 > > Christina.Salls at noaa.gov > > > > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Michael_Stevens at affymetrix.com Tue Apr 24 11:29:30 2012 From: Michael_Stevens at affymetrix.com (Stevens, Michael) Date: Tue, 24 Apr 2012 10:29:30 -0700 Subject: [torqueusers] pbs_sched cores - repost In-Reply-To: <242421BFAF465844BE24EB90BB97E2210908F705@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E2210908F705@ITSDAG1D.its.iastate.edu> Message-ID: > Two possibilities worth exploring: [snip] > http://serverfault.com/questions/253932/torque-works-half-of-the-time-fails-no-permission-the-other-half This thread led me to discover some old .BK files in ./server_priv/jobs from back in February. I've backed up and deleted those files, and restarted the server and scheduler. Hopefully, that will help. From christina.salls at noaa.gov Tue Apr 24 12:31:04 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 14:31:04 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: <4F96E2D0.7000201@ldeo.columbia.edu> References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 1:28 PM, Gus Correa wrote: > > On 04/24/2012 01:09 PM, Christina Salls wrote: > > > > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > > > wrote: > > > > Sorry for the confusion, I meant that you should replace > node> with the node's name. Thanks for clarifying Gus. > > > > Of course, I should have realized that! Thanks again. > > > > Now that I have node n020 assigned to the hydrology queue, will that > > make it unavailable to the batch queue? > > > > > In maui.cfg > > ENABLEMULTIREQJOBS TRUE > > See the Maui Admin Guide for the meaning: > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > I am still using the built in scheduler in torque. I haven't installed maui yet. > I hope it helps, > Gus Correa > > > David > > > > > > On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls > > > wrote: > > > > > > > > On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa > > > wrote: > > > > On 04/24/2012 12:00 PM, Christina Salls wrote: > > > Hi David, > > > > > > I tried using that synax and had a problem: > > > > > > Qmgr: set queue hydrology > resources_default.need_nodes= > > > > The syntax takes just the node name, I believe: > > > > set queue hydrology resources_default.need_nodes=n020 > > > > > > Thanks Gus, I unset the resource and reset it without the > brackets. > > Qmgr: unset queue hydrology resources_default.neednodes > > Qmgr: set queue hydrology resources_default.neednodes= n020 > > Qmgr: print queue hydrology > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue hydrology > > # > > create queue hydrology > > set queue hydrology queue_type = Execution > > set queue hydrology resources_default.neednodes = n020 > > set queue hydrology enabled = True > > set queue hydrology started = True > > > > > > > qmgr obj=hydrology svr=default: Unknown resource type > > > resources_default.need_nodes > > > > > > This is what ended up working for me. Thanks again for > > your help! > > > > > > Qmgr: set queue hydrology > resources_default.neednodes= > > > Qmgr: print queue hydrology > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue hydrology > > > # > > > create queue hydrology > > > set queue hydrology queue_type = Execution > > > set queue hydrology resources_default.neednodes = > > > set queue hydrology enabled = True > > > set queue hydrology started = True > > > Qmgr: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > > > > > > >> wrote: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > > > > > >> > > > wrote: > > > > > > Christina > > > > > > To make the hydrology queue request > node> by default - > > > > > > 'set queue hydrology > > resources_default.need_nodes=' > > > > > > David > > > > > > > > > Thanks David!! > > > > > > > > > > > > > > > > > > -- > > > Christina A. Salls > > > GLERL Computer Group > > > help.glerl at noaa.gov > > > > > > Help Desk x2127 > > > Christina.Salls at noaa.gov > > > > > > > > > Voice Mail 734-741-2446 > > > > > > > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org torqueusers at supercluster.org> > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/abe974dd/attachment-0001.html From gus at ldeo.columbia.edu Tue Apr 24 12:56:38 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 24 Apr 2012 14:56:38 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> Message-ID: <4F96F766.7050607@ldeo.columbia.edu> On 04/24/2012 02:31 PM, Christina Salls wrote: > > > On Tue, Apr 24, 2012 at 1:28 PM, Gus Correa > wrote: > > > On 04/24/2012 01:09 PM, Christina Salls wrote: > > > > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > > >> wrote: > > > > Sorry for the confusion, I meant that you should replace > > node> with the node's name. Thanks for clarifying Gus. > > > > Of course, I should have realized that! Thanks again. > > > > Now that I have node n020 assigned to the hydrology queue, will that > > make it unavailable to the batch queue? > > > > > In maui.cfg > > ENABLEMULTIREQJOBS TRUE > > See the Maui Admin Guide for the meaning: > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > I am still using the built in scheduler in torque. I haven't installed > maui yet. > Not sure if this will work with pbs_sched, but may be worth trying. 1) In server_priv/nodes add some 'properties' to your nodes, something like this: n001 np=8 generic n002 np=8 generic ... n019 np=8 generic n020 np=8 hydro 'generic' and 'hydro' are node 'properties', you can call them any string you want, and use as many 'properties' you want/need for each node. http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.2nodeproperties.php http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/4.1queueconfig.php#mapping http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php 2) Reconfigure slightly your queues to default to 'generic' and 'hydro' nodes, say: set queue batch resources_default.neednodes = generic set queue hydrology resources_default.neednodes = hydro [unset the previous neednodes = 020 in the hydrology queue]. I hope this helps, Gus Correa > I hope it helps, > Gus Correa > > > David > > > > > > On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls > > > >> > wrote: > > > > > > > > On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa > > > >> wrote: > > > > On 04/24/2012 12:00 PM, Christina Salls wrote: > > > Hi David, > > > > > > I tried using that synax and had a problem: > > > > > > Qmgr: set queue hydrology resources_default.need_nodes= > > > > The syntax takes just the node name, I believe: > > > > set queue hydrology resources_default.need_nodes=n020 > > > > > > Thanks Gus, I unset the resource and reset it without the > brackets. > > Qmgr: unset queue hydrology resources_default.neednodes > > Qmgr: set queue hydrology resources_default.neednodes= n020 > > Qmgr: print queue hydrology > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue hydrology > > # > > create queue hydrology > > set queue hydrology queue_type = Execution > > set queue hydrology resources_default.neednodes = n020 > > set queue hydrology enabled = True > > set queue hydrology started = True > > > > > > > qmgr obj=hydrology svr=default: Unknown resource type > > > resources_default.need_nodes > > > > > > This is what ended up working for me. Thanks again for > > your help! > > > > > > Qmgr: set queue hydrology resources_default.neednodes= > > > Qmgr: print queue hydrology > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue hydrology > > > # > > > create queue hydrology > > > set queue hydrology queue_type = Execution > > > set queue hydrology resources_default.neednodes = > > > set queue hydrology enabled = True > > > set queue hydrology started = True > > > Qmgr: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > > > > > > > > > >>> wrote: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > > > > > > > > >>> > > > wrote: > > > > > > Christina > > > > > > To make the hydrology queue request > node> by default - > > > > > > 'set queue hydrology > > resources_default.need_nodes=' > > > > > > David > > > > > > > > > Thanks David!! > > > > > > > > > > > > > > > > > > -- > > > Christina A. Salls > > > GLERL Computer Group > > > help.glerl at noaa.gov > > > > > >> > > > Help Desk x2127 > > > Christina.Salls at noaa.gov > > > > > > > >> > > > Voice Mail 734-741-2446 > > > > > > > > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > > > Help Desk x2127 > > Christina.Salls at noaa.gov > > > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > > > Help Desk x2127 > > Christina.Salls at noaa.gov > > > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From arkaaloke at gmail.com Tue Apr 24 13:02:20 2012 From: arkaaloke at gmail.com (Arka Aloke Bhattacharya) Date: Tue, 24 Apr 2012 12:02:20 -0700 Subject: [torqueusers] Is there any maui command that allows me to view possible node allocation of a job when it is queued ? Message-ID: Hi , Is there any maui command/interface which allows me to view possible node allocations of a job, when the cluster is full and the job gets queued up ? Does this possible allocation also include nodes that are currently offline/down ? Arka. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/7ad95718/attachment.html From christina.salls at noaa.gov Tue Apr 24 14:03:51 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 16:03:51 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: <4F96F766.7050607@ldeo.columbia.edu> References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> <4F96F766.7050607@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 2:56 PM, Gus Correa wrote: > On 04/24/2012 02:31 PM, Christina Salls wrote: > > > > > > On Tue, Apr 24, 2012 at 1:28 PM, Gus Correa > > wrote: > > > > > > On 04/24/2012 01:09 PM, Christina Salls wrote: > > > > > > > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > > > > > >> wrote: > > > > > > Sorry for the confusion, I meant that you should replace > > > > node> with the node's name. Thanks for clarifying Gus. > > > > > > Of course, I should have realized that! Thanks again. > > > > > > Now that I have node n020 assigned to the hydrology queue, will > that > > > make it unavailable to the batch queue? > > > > > > > > > In maui.cfg > > > > ENABLEMULTIREQJOBS TRUE > > > > See the Maui Admin Guide for the meaning: > > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > > > I am still using the built in scheduler in torque. I haven't installed > > maui yet. > > > > Not sure if this will work with pbs_sched, > but may be worth trying. > > 1) In server_priv/nodes add some 'properties' to your nodes, > something like this: > > n001 np=8 generic > n002 np=8 generic > ... > n019 np=8 generic > n020 np=8 hydro > > 'generic' and 'hydro' are node 'properties', > you can call them any string you want, and use as many > 'properties' you want/need for each node. > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.2nodeproperties.php > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/4.1queueconfig.php#mapping > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php > > 2) Reconfigure slightly your queues to default > to 'generic' and 'hydro' nodes, say: > > set queue batch resources_default.neednodes = generic > set queue hydrology resources_default.neednodes = hydro > > [unset the previous neednodes = 020 in the hydrology queue]. > > I hope this helps, > Gus Correa > > Definitely worth a try - it makes perfect sense. Thanks for the suggestion. The only roadblock I am concerned with is the fact that something within torque dynamically modifies the nodes file. I am hoping it will leave my modifications intact. I will let you know what happens. Thanks! > > > I hope it helps, > > Gus Correa > > > > > David > > > > > > > > > On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls > > > > > >> > > wrote: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa > > > > > >> > wrote: > > > > > > On 04/24/2012 12:00 PM, Christina Salls wrote: > > > > Hi David, > > > > > > > > I tried using that synax and had a problem: > > > > > > > > Qmgr: set queue hydrology resources_default.need_nodes= > > > > > > The syntax takes just the node name, I believe: > > > > > > set queue hydrology resources_default.need_nodes=n020 > > > > > > > > > Thanks Gus, I unset the resource and reset it without the > > brackets. > > > Qmgr: unset queue hydrology resources_default.neednodes > > > Qmgr: set queue hydrology resources_default.neednodes= > n020 > > > Qmgr: print queue hydrology > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue hydrology > > > # > > > create queue hydrology > > > set queue hydrology queue_type = Execution > > > set queue hydrology resources_default.neednodes = n020 > > > set queue hydrology enabled = True > > > set queue hydrology started = True > > > > > > > > > > qmgr obj=hydrology svr=default: Unknown resource type > > > > resources_default.need_nodes > > > > > > > > This is what ended up working for me. Thanks again for > > > your help! > > > > > > > > Qmgr: set queue hydrology resources_default.neednodes= > > > > Qmgr: print queue hydrology > > > > # > > > > # Create queues and set their attributes. > > > > # > > > > # > > > > # Create and define queue hydrology > > > > # > > > > create queue hydrology > > > > set queue hydrology queue_type = Execution > > > > set queue hydrology resources_default.neednodes = > > > > set queue hydrology enabled = True > > > > set queue hydrology started = True > > > > Qmgr: > > > > > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > > > > > > >> > > > > > > > > >>> wrote: > > > > > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > > dbeer at adaptivecomputing.com> > > > > > > > > > > > > > >>> > > > > wrote: > > > > > > > > Christina > > > > > > > > To make the hydrology queue request > > node> by default - > > > > > > > > 'set queue hydrology > > > resources_default.need_nodes=' > > > > > > > > David > > > > > > > > > > > > Thanks David!! > > > > > > > > > > > > > > > > > > > > > > > > -- > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/12335255/attachment-0001.html From akohlmey at cmm.chem.upenn.edu Tue Apr 24 14:23:28 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Tue, 24 Apr 2012 16:23:28 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> <4F96F766.7050607@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 4:03 PM, Christina Salls wrote: >> > > Definitely worth a try - it makes perfect sense. ?Thanks for the suggestion. > ?The only roadblock I am concerned with is the fact that something within > torque dynamically modifies the nodes file. ?I am hoping it will leave my > modifications intact. ?I will let you know what happens. you can do this change dynamically from qmgr. qmgr -c 'set node mynode properties += myprop' it will rewrite your nodes file automatically and you don't have to restart torque. axel. > > Thanks! -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From gus at ldeo.columbia.edu Tue Apr 24 14:48:57 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 24 Apr 2012 16:48:57 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> <4F96F766.7050607@ldeo.columbia.edu> Message-ID: <4F9711B9.4050002@ldeo.columbia.edu> On 04/24/2012 04:03 PM, Christina Salls wrote: > > > On Tue, Apr 24, 2012 at 2:56 PM, Gus Correa > wrote: > > On 04/24/2012 02:31 PM, Christina Salls wrote: > > > > > > On Tue, Apr 24, 2012 at 1:28 PM, Gus Correa > > > >> wrote: > > > > > > On 04/24/2012 01:09 PM, Christina Salls wrote: > > > > > > > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > > > > > > > > >>> wrote: > > > > > > Sorry for the confusion, I meant that you should replace > > > > node> with the node's name. Thanks for clarifying Gus. > > > > > > Of course, I should have realized that! Thanks again. > > > > > > Now that I have node n020 assigned to the hydrology queue, will > that > > > make it unavailable to the batch queue? > > > > > > > > > In maui.cfg > > > > ENABLEMULTIREQJOBS TRUE > > > > See the Maui Admin Guide for the meaning: > > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > > > I am still using the built in scheduler in torque. I haven't > installed > > maui yet. > > > > Not sure if this will work with pbs_sched, > but may be worth trying. > > 1) In server_priv/nodes add some 'properties' to your nodes, > something like this: > > n001 np=8 generic > n002 np=8 generic > ... > n019 np=8 generic > n020 np=8 hydro > > 'generic' and 'hydro' are node 'properties', > you can call them any string you want, and use as many > 'properties' you want/need for each node. > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.2nodeproperties.php > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/4.1queueconfig.php#mapping > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php > > 2) Reconfigure slightly your queues to default > to 'generic' and 'hydro' nodes, say: > > set queue batch resources_default.neednodes = generic > set queue hydrology resources_default.neednodes = hydro > > [unset the previous neednodes = 020 in the hydrology queue]. > > I hope this helps, > Gus Correa > > > Definitely worth a try - it makes perfect sense. Thanks for the > suggestion. The only roadblock I am concerned with is the fact that > something within torque dynamically modifies the nodes file. I am > hoping it will leave my modifications intact. I will let you know what > happens. > > Thanks! > I edit my nodes file by hand, not with qmgr, and never had this problem. After editing it, you may need to restart pbs_server, and perhaps pbs_sched too. I hope it helps, Gus Correa > > > I hope it helps, > > Gus Correa > > > > > David > > > > > > > > > On Tue, Apr 24, 2012 at 10:21 AM, Christina Salls > > > > > > > >>> > > wrote: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 12:10 PM, Gus Correa > > > > > > > > >>> wrote: > > > > > > On 04/24/2012 12:00 PM, Christina Salls wrote: > > > > Hi David, > > > > > > > > I tried using that synax and had a problem: > > > > > > > > Qmgr: set queue hydrology resources_default.need_nodes= > > > > > > The syntax takes just the node name, I believe: > > > > > > set queue hydrology resources_default.need_nodes=n020 > > > > > > > > > Thanks Gus, I unset the resource and reset it without the > > brackets. > > > Qmgr: unset queue hydrology resources_default.neednodes > > > Qmgr: set queue hydrology resources_default.neednodes= n020 > > > Qmgr: print queue hydrology > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue hydrology > > > # > > > create queue hydrology > > > set queue hydrology queue_type = Execution > > > set queue hydrology resources_default.neednodes = n020 > > > set queue hydrology enabled = True > > > set queue hydrology started = True > > > > > > > > > > qmgr obj=hydrology svr=default: Unknown resource type > > > > resources_default.need_nodes > > > > > > > > This is what ended up working for me. Thanks again for > > > your help! > > > > > > > > Qmgr: set queue hydrology resources_default.neednodes= > > > > Qmgr: print queue hydrology > > > > # > > > > # Create queues and set their attributes. > > > > # > > > > # > > > > # Create and define queue hydrology > > > > # > > > > create queue hydrology > > > > set queue hydrology queue_type = Execution > > > > set queue hydrology resources_default.neednodes = > > > > set queue hydrology enabled = True > > > > set queue hydrology started = True > > > > Qmgr: > > > > > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:50 AM, Christina Salls > > > > > > > > > >> > > > > > > > > > >>>> wrote: > > > > > > > > > > > > > > > > On Tue, Apr 24, 2012 at 11:38 AM, David Beer > > > > > > > > > > > >> > > > > > > > > > > > >>>> > > > > wrote: > > > > > > > > Christina > > > > > > > > To make the hydrology queue request > > node> by default - > > > > > > > > 'set queue hydrology > > > resources_default.need_nodes=' > > > > > > > > David > > > > > > > > > > > > Thanks David!! > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roy.dragseth at cc.uit.no Tue Apr 24 15:14:55 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 24 Apr 2012 23:14:55 +0200 Subject: [torqueusers] pbs_server crashes when deleting and adding nodes. In-Reply-To: <1403691.J1VpU34d66@newton.cc.uit.no> References: <1460284.hyFAUKa0Pm@newton.cc.uit.no> <1403691.J1VpU34d66@newton.cc.uit.no> Message-ID: <3057408.slio2xVQKQ@lux> On Tuesday 24. April 2012 15.19.06 Roy Dragseth wrote: > Hmm, the problem seems to be solved in the 3.0.5 release candidate announced > earlier. Any projections on the release date of 3.0.5? > Seems like I jumped the gun here. I'm able to reproduce the same problem on torque-3.0.5-snap.201204051313 Further testing seems to indicate that it works better if one deletes all the nodes first and then adds all the nodes afterwards instead of alternating between deleting and adding nodes. r. From christina.salls at noaa.gov Tue Apr 24 15:28:54 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 17:28:54 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> <4F96F766.7050607@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 4:23 PM, Axel Kohlmeyer wrote: > On Tue, Apr 24, 2012 at 4:03 PM, Christina Salls > wrote: > >> > > > > Definitely worth a try - it makes perfect sense. Thanks for the > suggestion. > > The only roadblock I am concerned with is the fact that something within > > torque dynamically modifies the nodes file. I am hoping it will leave my > > modifications intact. I will let you know what happens. > > you can do this change dynamically from qmgr. > > qmgr -c 'set node mynode properties += myprop' > > it will rewrite your nodes file automatically > and you don't have to restart torque. > > axel. > > > Ok. I did this by hand and it worked perfectly. I just tested it. [root at wings ~]# qmgr Max open servers: 10239 Qmgr: set node n001 properties += generic Qmgr: set node n002 properties += generic Qmgr: set node n003 properties += generic Qmgr: set node n004 properties += generic Qmgr: set node n005 properties += generic Qmgr: set node n006 properties += generic Qmgr: set node n007 properties += generic Qmgr: set node n008 properties += generic Qmgr: set node n009 properties += generic Qmgr: set node n010 properties += generic Qmgr: set node n011 properties += generic Qmgr: set node n012 properties += generic Qmgr: set node n013 properties += generic Qmgr: set node n014 properties += generic Qmgr: set node n015 properties += generic Qmgr: set node n016 properties += generic Qmgr: set node n017 properties += generic Qmgr: set node n018 properties += generic Qmgr: set node n019 properties += generic Qmgr: set node n020 properties += hydro Qmgr: set queue batch resources_default.neednodes = generic Qmgr: set queue hydrology resources_default.neednodes = hydro Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue hydrology # create queue hydrology set queue hydrology queue_type = Execution set queue hydrology resources_default.neednodes = hydro set queue hydrology enabled = True set queue hydrology started = True # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.neednodes = generic set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = admin.default.domain set server acl_hosts += wings.glerl.noaa.gov set server managers = root at wings.glerl.noaa.gov set server managers += root@* set server managers += salls at wings.glerl.noaa.gov set server operators = root at wings.glerl.noaa.gov set server operators += salls at wings.glerl.noaa.gov set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 100 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 0 set server auto_node_np = True set server next_job_number = 760 set server np_default = 12 set server record_job_info = True set server record_job_script = True set server job_log_keep_days = 10 Qmgr: quit [root at wings ~]# pbsnodes -a n001 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300427,varattr=,jobs=,state=free,netload=590944113,gres=,loadave=0.00,ncpus=12,physmem=20463136kb,availmem=27849964kb,totmem=28655128kb,idletime=536312,nusers=1,nsessions=1,sessions=30936,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n002 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300419,varattr=,jobs=,state=free,netload=771508829,gres=,loadave=0.03,ncpus=12,physmem=24600084kb,availmem=31908192kb,totmem=32792076kb,idletime=536263,nusers=1,nsessions=1,sessions=29658,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n003 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300399,varattr=,jobs=,state=free,netload=128461870,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31938400kb,totmem=32792076kb,idletime=536273,nusers=1,nsessions=1,sessions=29463,uname=Linux n003 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n004 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300407,varattr=,jobs=,state=free,netload=5983889356,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31914592kb,totmem=32792076kb,idletime=536300,nusers=1,nsessions=1,sessions=29625,uname=Linux n004 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n005 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300387,varattr=,jobs=,state=free,netload=128206689,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31921432kb,totmem=32792076kb,idletime=536240,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n005 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n006 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300404,varattr=,jobs=,state=free,netload=128036017,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31958100kb,totmem=32792076kb,idletime=536270,nusers=1,nsessions=1,sessions=29461,uname=Linux n006 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n007 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300422,varattr=,jobs=,state=free,netload=128890681,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31921664kb,totmem=32792076kb,idletime=536307,nusers=1,nsessions=1,sessions=29488,uname=Linux n007 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n008 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300426,varattr=,jobs=,state=free,netload=127780009,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31924172kb,totmem=32792076kb,idletime=536287,nusers=1,nsessions=1,sessions=29439,uname=Linux n008 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n009 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300416,varattr=,jobs=,state=free,netload=121987956,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31923464kb,totmem=32792076kb,idletime=536300,nusers=1,nsessions=1,sessions=28949,uname=Linux n009 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n010 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300386,varattr=,jobs=,state=free,netload=121643821,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31919024kb,totmem=32792076kb,idletime=536259,nusers=1,nsessions=1,sessions=28894,uname=Linux n010 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n011 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300403,varattr=,jobs=,state=free,netload=121509658,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31918700kb,totmem=32792076kb,idletime=536265,nusers=1,nsessions=1,sessions=28907,uname=Linux n011 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n012 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300394,varattr=,jobs=,state=free,netload=58333295055,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31890172kb,totmem=32792076kb,idletime=536261,nusers=1,nsessions=1,sessions=29498,uname=Linux n012 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n013 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300428,varattr=,jobs=,state=free,netload=119704698,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31918100kb,totmem=32792076kb,idletime=536310,nusers=1,nsessions=1,sessions=28869,uname=Linux n013 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n014 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300391,varattr=,jobs=,state=free,netload=119141203,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31921004kb,totmem=32792076kb,idletime=536248,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n014 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n015 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300401,varattr=,jobs=,state=free,netload=118595178,gres=,loadave=0.06,ncpus=12,physmem=24600084kb,availmem=31920968kb,totmem=32792076kb,idletime=536270,nusers=1,nsessions=1,sessions=28897,uname=Linux n015 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n016 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300390,varattr=,jobs=,state=free,netload=118831227,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31926336kb,totmem=32792076kb,idletime=536271,nusers=1,nsessions=1,sessions=28918,uname=Linux n016 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n017 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300405,varattr=,jobs=,state=free,netload=118973541,gres=,loadave=0.14,ncpus=12,physmem=24600084kb,availmem=31919772kb,totmem=32792076kb,idletime=536297,nusers=1,nsessions=1,sessions=28922,uname=Linux n017 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n018 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300392,varattr=,jobs=,state=free,netload=257644933486,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31858760kb,totmem=32792076kb,idletime=536236,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n018 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n019 state = free np = 12 properties = generic ntype = cluster status = rectime=1335300395,varattr=,jobs=,state=free,netload=114280868,gres=,loadave=0.00,ncpus=12,physmem=24600084kb,availmem=31924588kb,totmem=32792076kb,idletime=536265,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n019 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n020 state = free np = 12 properties = hydro ntype = cluster status = rectime=1335300389,varattr=,jobs=,state=free,netload=58243822999,gres=,loadave=0.02,ncpus=12,physmem=24600084kb,availmem=31868648kb,totmem=32792076kb,idletime=536274,nusers=0,nsessions=? 15201,sessions=? 15201,uname=Linux n020 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 > > > > Thanks! > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/3cb89c75/attachment-0001.html From christina.salls at noaa.gov Tue Apr 24 15:30:25 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 24 Apr 2012 17:30:25 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: <4F9711B9.4050002@ldeo.columbia.edu> References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> <4F96F766.7050607@ldeo.columbia.edu> <4F9711B9.4050002@ldeo.columbia.edu> Message-ID: On Tue, Apr 24, 2012 at 4:48 PM, Gus Correa wrote: > On 04/24/2012 04:03 PM, Christina Salls wrote: > > > > > > On Tue, Apr 24, 2012 at 2:56 PM, Gus Correa > > wrote: > > > > On 04/24/2012 02:31 PM, Christina Salls wrote: > > > > > > > > > On Tue, Apr 24, 2012 at 1:28 PM, Gus Correa > > > > > >> > wrote: > > > > > > > > > On 04/24/2012 01:09 PM, Christina Salls wrote: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > > > > > > > > > > > > > > > > >>> wrote: > > > > > > > > Sorry for the confusion, I meant that you should replace > > > > > > node> with the node's name. Thanks for clarifying Gus. > > > > > > > > Of course, I should have realized that! Thanks again. > > > > > > > > Now that I have node n020 assigned to the hydrology queue, will > > that > > > > make it unavailable to the batch queue? > > > > > > > > > > > > > In maui.cfg > > > > > > ENABLEMULTIREQJOBS TRUE > > > > > > See the Maui Admin Guide for the meaning: > > > > > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > > > > > I am still using the built in scheduler in torque. I haven't > > installed > > > maui yet. > > > > > > > Not sure if this will work with pbs_sched, > > but may be worth trying. > > > > 1) In server_priv/nodes add some 'properties' to your nodes, > > something like this: > > > > n001 np=8 generic > > n002 np=8 generic > > ... > > n019 np=8 generic > > n020 np=8 hydro > > > > 'generic' and 'hydro' are node 'properties', > > you can call them any string you want, and use as many > > 'properties' you want/need for each node. > > > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.2nodeproperties.php > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/4.1queueconfig.php#mapping > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php > > > > 2) Reconfigure slightly your queues to default > > to 'generic' and 'hydro' nodes, say: > > > > set queue batch resources_default.neednodes = generic > > set queue hydrology resources_default.neednodes = hydro > > > > [unset the previous neednodes = 020 in the hydrology queue]. > > > > I hope this helps, > > Gus Correa > > > > > > Definitely worth a try - it makes perfect sense. Thanks for the > > suggestion. The only roadblock I am concerned with is the fact that > > something within torque dynamically modifies the nodes file. I am > > hoping it will leave my modifications intact. I will let you know what > > happens. > > > > Thanks! > > > > I edit my nodes file by hand, not with qmgr, > and never had this problem. > After editing it, you may need to restart pbs_server, > and perhaps pbs_sched too. > > Thanks Gus!! You guys were all great. I really appreciate the help. > I hope it helps, > Gus Correa > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/b26d4759/attachment.html From gus at ldeo.columbia.edu Tue Apr 24 16:25:18 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 24 Apr 2012 18:25:18 -0400 Subject: [torqueusers] assigning resources to a queue In-Reply-To: References: <4F96D068.5050401@ldeo.columbia.edu> <4F96E2D0.7000201@ldeo.columbia.edu> <4F96F766.7050607@ldeo.columbia.edu> <4F9711B9.4050002@ldeo.columbia.edu> Message-ID: <4F97284E.4020302@ldeo.columbia.edu> On 04/24/2012 05:30 PM, Christina Salls wrote: > > > On Tue, Apr 24, 2012 at 4:48 PM, Gus Correa > wrote: > > On 04/24/2012 04:03 PM, Christina Salls wrote: > > > > > > On Tue, Apr 24, 2012 at 2:56 PM, Gus Correa > > > >> wrote: > > > > On 04/24/2012 02:31 PM, Christina Salls wrote: > > > > > > > > > On Tue, Apr 24, 2012 at 1:28 PM, Gus Correa > > > > > > > > >>> wrote: > > > > > > > > > On 04/24/2012 01:09 PM, Christina Salls wrote: > > > > > > > > > > > > On Tue, Apr 24, 2012 at 12:59 PM, David Beer > > > > > > > > > > > >> > > > > > > > > > > > >>>> wrote: > > > > > > > > Sorry for the confusion, I meant that you should replace > > > > > > node> with the node's name. Thanks for clarifying Gus. > > > > > > > > Of course, I should have realized that! Thanks again. > > > > > > > > Now that I have node n020 assigned to the hydrology queue, will > > that > > > > make it unavailable to the batch queue? > > > > > > > > > > > > > In maui.cfg > > > > > > ENABLEMULTIREQJOBS TRUE > > > > > > See the Maui Admin Guide for the meaning: > > > > > > http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php > > > > > > I am still using the built in scheduler in torque. I haven't > > installed > > > maui yet. > > > > > > > Not sure if this will work with pbs_sched, > > but may be worth trying. > > > > 1) In server_priv/nodes add some 'properties' to your nodes, > > something like this: > > > > n001 np=8 generic > > n002 np=8 generic > > ... > > n019 np=8 generic > > n020 np=8 hydro > > > > 'generic' and 'hydro' are node 'properties', > > you can call them any string you want, and use as many > > 'properties' you want/need for each node. > > > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.2nodeproperties.php > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/4.1queueconfig.php#mapping > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/2.1jobsubmission.php > > > > 2) Reconfigure slightly your queues to default > > to 'generic' and 'hydro' nodes, say: > > > > set queue batch resources_default.neednodes = generic > > set queue hydrology resources_default.neednodes = hydro > > > > [unset the previous neednodes = 020 in the hydrology queue]. > > > > I hope this helps, > > Gus Correa > > > > > > Definitely worth a try - it makes perfect sense. Thanks for the > > suggestion. The only roadblock I am concerned with is the fact that > > something within torque dynamically modifies the nodes file. I am > > hoping it will leave my modifications intact. I will let you > know what > > happens. > > > > Thanks! > > > > I edit my nodes file by hand, not with qmgr, > and never had this problem. > After editing it, you may need to restart pbs_server, > and perhaps pbs_sched too. > > Thanks Gus!! Hi Christina Glad that it worked. As Axel pointed out, if you want to change the nodes file dynamically, qmgr is the way to do it. > You guys were all great. I really appreciate the help. > Hmmm ... If your users learn about the node properties that you use to assign different node groups to the hydrology and batch queues, some creative fellow may try something like this: #PBS -q batch #PBS -l nodes=19:generic+1:hydro ... which may work. :) Anyway, you can teach the users a standard style: #PBS -q batch #PBS -l nodes=N [or perhaps nodes=N:ppn=P] #PBS -N myjob ... or #PBS -q hydrology #PBS -l nodes=1 #PBS -N myjob ... Gus Correa > I hope it helps, > Gus Correa > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From rhys.hill at adelaide.edu.au Tue Apr 24 17:33:35 2012 From: rhys.hill at adelaide.edu.au (Rhys Hill) Date: Tue, 24 Apr 2012 23:33:35 +0000 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: References: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au>, Message-ID: Hi David, I'm not sure - the user who was having trouble hasn't yet tried again. I'll put a note in bugzilla either way when we've tried again - I've been more focussed on getting our normal jobs working! With the changes I suggested in bugzilla, 4.0.1 is working well for me, except that most or all job arrays aren't being cleaned up. It seems like there must be some code somewhere that looks for all the jobs in an array to have finished, then cleans up the array structures themselves. I've had a look, but cannot find where this should happen. Can you tell me where that is? If I can fix this issue, then I think 4.0.1 will be back to the same level of reliability as 2.5.9 for us (with more reliable cpusets as well!) Cheers, Rhys ---- Senior Research Associate, Australian Centre for Visual Technologies On 25/04/2012, at 1:16 AM, "David Beer" > wrote: Rhys, Just to confirm - that patch fixed your problem? If so I will see that it gets checked in. We will look at these other bugzilla issues that you created. Thanks for taking the time to report them and in some cases offer solutions. We really appreciate the effort to help make TORQUE better. David On Tue, Apr 24, 2012 at 12:23 AM, Rhys Hill > wrote: Hi David, Thanks for that. I've just found and fixed some other bugs which I've added to bugzilla. The one issue that remains is odd. It seems that we have a situation where an array is stuck, when all of it's component jobs are finished. For instance, qstat -f says this: Job Id: 678[].moby.cs.adelaide.edu.au Job_Name = YZ_Oxford_group Job_Owner = yanzhichen at moby.cs.adelaide.edu.au job_state = Q queue = batch server = moby.cs.adelaide.edu.au Checkpoint = u ctime = Tue Apr 24 09:26:10 2012 Error_Path = moby.cs.adelaide.edu.au:/home/yanzhichen/moby/oxbuilding_voca bulary/out.e.txt Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Apr 24 09:26:10 2012 Output_Path = moby.cs.adelaide.edu.au:/home/yanzhichen/moby/oxbuilding_voc abulary/out.o.txt Priority = 0 qtime = Tue Apr 24 09:26:10 2012 Rerunable = True Resource_List.mem = 5gb Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.pmem = 5gb Resource_List.pvmem = 8gb Resource_List.walltime = 48:00:00 etime = Tue Apr 24 09:26:10 2012 submit_args = -t 2-11 ./job_dogroup job_array_request = 2-11 fault_tolerant = False job_radix = 0 submit_host = moby.cs.adelaide.edu.au init_work_dir = /home/yanzhichen/moby/oxbuilding_vocabulary whereas qstat -ft has no mention of 678[x] at all. qdel and qdel -p have no effect on jobs like these. I think I've submitted a fix for the problem that leads to the job getting into this state, but it would be handy if qdel could remove it. Thanks, On 24/04/2012, at 2:52 AM, David Beer wrote: > Rhys, > > I'm surprised that you say you haven't seen this message before, as the check exists in both places and has been there since 2.5 was released. There must've been a bug that allowed it before. In this case, please try the attached patch to see if it resolves your problem for 4.0. This patch only requires you to rebuild and restart the server (dependencies are unknown to pbs_moms). > > David > > On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill > wrote: > Hi everyone, > > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because we'd been > having some trouble with cpusets and I'd hoped that the support for hwloc would > resolve the problem. cpusets are now working very well, but I'm having a lot of > trouble with job arrays, which form a very large part of our workload. > > Torque 4.0.0 would regularly lock-up when processing job arrays, so I upgraded to > the most recent 4.0.1 snapshot, and that behaves much better, but still seems > unstable compared to 2.5.9. > > One concrete issue is that many of our jobs that worked fine with 2.5.9 are now > stalling with 4.0.1 with the following message: > > "Arrays may only be given array dependencies" > > which only seems to appear in the server logs and is otherwise invisible. This > was certainly never true before, and doesn't really make sense. We frequently > use array->single job dependencies for scatter-gather type operations. > > Once the above message has been printed, the job arrays sit in a hold state forever. > They can't be removed using qdel and if I try to break the hold using qrls or > mjobctl, they move into the queued state, but they disappear from moab and never > actually start, and still can't be removed. The only way I can get rid of them > is to bring down pbs_server, which has to killed via `killall -QUIT pbs_server` > since the init script cannot stop the process properly, and delete the job > files manually. > > I'm currently thinking of just reverting to the old, working version of torque, > but has anyone else had trouble with job arrays and can the above problems be > fixed somehow? > > Thanks, > > -------------------------------------------------------------------------------- > Rhys Hill, Senior Research Associate > Australian Centre for Visual Technologies University of Adelaide > > Phone: +61 8 8313 6197 Mail: > Fax: +61 8 8313 4366 School of Computer Science > University of Adelaide > Adelaide, Australia > http://www.cs.adelaide.edu.au/~rhys/ 5005 > -------------------------------------------------------------------------------- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------------------------------------------------------------------------- Rhys Hill, Senior Research Associate Australian Centre for Visual Technologies University of Adelaide Phone: +61 8 8313 6197 Mail: Fax: +61 8 8313 4366 School of Computer Science University of Adelaide Adelaide, Australia http://www.cs.adelaide.edu.au/~rhys/ 5005 -------------------------------------------------------------------------------- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/3db9f18f/attachment-0001.html From dbeer at adaptivecomputing.com Tue Apr 24 18:03:35 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 24 Apr 2012 18:03:35 -0600 Subject: [torqueusers] Torque 4.0 and job arrays In-Reply-To: References: <127C9691-E776-409C-8305-E5EDF8B47440@adelaide.edu.au> Message-ID: Rhys, Once such place is in job_purge in src/server/job_func.c, if all of the jobs have been purged, the array is then also purged. If you search the code for the places that call array_delete, then you'll see all of the conditions under which it is called. Most of them are error conditions, but I figure you might want to check them all. David On Tue, Apr 24, 2012 at 5:33 PM, Rhys Hill wrote: > Hi David, > > I'm not sure - the user who was having trouble hasn't yet tried again. > I'll put a note in bugzilla either way when we've tried again - I've been > more focussed on getting our normal jobs working! > > With the changes I suggested in bugzilla, 4.0.1 is working well for me, > except that most or all job arrays aren't being cleaned up. It seems like > there must be some code somewhere that looks for all the jobs in an array > to have finished, then cleans up the array structures themselves. I've had > a look, but cannot find where this should happen. Can you tell me where > that is? If I can fix this issue, then I think 4.0.1 will be back to the > same level of reliability as 2.5.9 for us (with more reliable cpusets as > well!) > > Cheers, Rhys > > ---- > > Senior Research Associate, > Australian Centre for Visual Technologies > > On 25/04/2012, at 1:16 AM, "David Beer" > wrote: > > Rhys, > > Just to confirm - that patch fixed your problem? If so I will see that > it gets checked in. We will look at these other bugzilla issues that you > created. Thanks for taking the time to report them and in some cases offer > solutions. We really appreciate the effort to help make TORQUE better. > > David > > On Tue, Apr 24, 2012 at 12:23 AM, Rhys Hill wrote: > >> Hi David, >> >> Thanks for that. I've just found and fixed some other bugs which I've >> added to >> bugzilla. The one issue that remains is odd. It seems that we have a >> situation >> where an array is stuck, when all of it's component jobs are finished. >> >> For instance, qstat -f says this: >> >> Job Id: 678[].moby.cs.adelaide.edu.au >> Job_Name = YZ_Oxford_group >> Job_Owner = yanzhichen at moby.cs.adelaide.edu.au >> job_state = Q >> queue = batch >> server = moby.cs.adelaide.edu.au >> Checkpoint = u >> ctime = Tue Apr 24 09:26:10 2012 >> Error_Path = moby.cs.adelaide.edu.au: >> /home/yanzhichen/moby/oxbuilding_voca >> bulary/out.e.txt >> Hold_Types = n >> Join_Path = n >> Keep_Files = n >> Mail_Points = a >> mtime = Tue Apr 24 09:26:10 2012 >> Output_Path = moby.cs.adelaide.edu.au: >> /home/yanzhichen/moby/oxbuilding_voc >> abulary/out.o.txt >> Priority = 0 >> qtime = Tue Apr 24 09:26:10 2012 >> Rerunable = True >> Resource_List.mem = 5gb >> Resource_List.nodect = 1 >> Resource_List.nodes = 1:ppn=1 >> Resource_List.pmem = 5gb >> Resource_List.pvmem = 8gb >> Resource_List.walltime = 48:00:00 >> etime = Tue Apr 24 09:26:10 2012 >> submit_args = -t 2-11 ./job_dogroup >> job_array_request = 2-11 >> fault_tolerant = False >> job_radix = 0 >> submit_host = moby.cs.adelaide.edu.au >> init_work_dir = /home/yanzhichen/moby/oxbuilding_vocabulary >> >> whereas qstat -ft has no mention of 678[x] at all. qdel and qdel -p have >> no effect >> on jobs like these. I think I've submitted a fix for the problem that >> leads to the >> job getting into this state, but it would be handy if qdel could remove >> it. >> >> Thanks, >> >> On 24/04/2012, at 2:52 AM, David Beer wrote: >> >> > Rhys, >> > >> > I'm surprised that you say you haven't seen this message before, as the >> check exists in both places and has been there since 2.5 was released. >> There must've been a bug that allowed it before. In this case, please try >> the attached patch to see if it resolves your problem for 4.0. This patch >> only requires you to rebuild and restart the server (dependencies are >> unknown to pbs_moms). >> > >> > David >> > >> > On Sun, Apr 22, 2012 at 9:31 PM, Rhys Hill >> wrote: >> > Hi everyone, >> > >> > I recently upgraded to torque 4.0 alongside moab 7.0, mostly because >> we'd been >> > having some trouble with cpusets and I'd hoped that the support for >> hwloc would >> > resolve the problem. cpusets are now working very well, but I'm having >> a lot of >> > trouble with job arrays, which form a very large part of our workload. >> > >> > Torque 4.0.0 would regularly lock-up when processing job arrays, so I >> upgraded to >> > the most recent 4.0.1 snapshot, and that behaves much better, but still >> seems >> > unstable compared to 2.5.9. >> > >> > One concrete issue is that many of our jobs that worked fine with 2.5.9 >> are now >> > stalling with 4.0.1 with the following message: >> > >> > "Arrays may only be given array dependencies" >> > >> > which only seems to appear in the server logs and is otherwise >> invisible. This >> > was certainly never true before, and doesn't really make sense. We >> frequently >> > use array->single job dependencies for scatter-gather type operations. >> > >> > Once the above message has been printed, the job arrays sit in a hold >> state forever. >> > They can't be removed using qdel and if I try to break the hold using >> qrls or >> > mjobctl, they move into the queued state, but they disappear from moab >> and never >> > actually start, and still can't be removed. The only way I can get rid >> of them >> > is to bring down pbs_server, which has to killed via `killall -QUIT >> pbs_server` >> > since the init script cannot stop the process properly, and delete the >> job >> > files manually. >> > >> > I'm currently thinking of just reverting to the old, working version of >> torque, >> > but has anyone else had trouble with job arrays and can the above >> problems be >> > fixed somehow? >> > >> > Thanks, >> > >> > >> -------------------------------------------------------------------------------- >> > Rhys Hill, Senior Research >> Associate >> > Australian Centre for Visual Technologies University of >> Adelaide >> > >> > Phone: +61 8 8313 6197 Mail: >> > Fax: +61 8 8313 4366 School of Computer >> Science >> > University of Adelaide >> > Adelaide, Australia >> > http://www.cs.adelaide.edu.au/~rhys/ 5005 >> > >> -------------------------------------------------------------------------------- >> > >> > _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> > >> > >> > >> > -- >> > David Beer | Software Engineer >> > Adaptive Computing >> > >> > _______________________________________________ >> > torqueusers mailing list >> > torqueusers at supercluster.org >> > http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> -------------------------------------------------------------------------------- >> Rhys Hill, Senior Research >> Associate >> Australian Centre for Visual Technologies University of >> Adelaide >> >> Phone: +61 8 8313 6197 Mail: >> Fax: +61 8 8313 4366 School of Computer >> Science >> University of Adelaide >> Adelaide, Australia >> http://www.cs.adelaide.edu.au/~rhys/ 5005 >> >> -------------------------------------------------------------------------------- >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120424/5b48ce66/attachment.html From chindw at wfu.edu Fri Apr 27 13:09:01 2012 From: chindw at wfu.edu (David Chin) Date: Fri, 27 Apr 2012 15:09:01 -0400 Subject: [torqueusers] openmpi, pbsdsh, and iptables Message-ID: I am having some trouble getting openmpi-1.5.5 working with Torque integration, and I have narrowed down the issue to firewall rules. However, I am not sure which ports to open up. I compiled openmpi-1.5.5 on RHEL6 against the libtorque that EPEL provides -- version 2.5.7-9.el6. I found that openmpi jobs cannot communicate between nodes if iptables is running. I have also found, by starting an interactive job, that pbsdsh cannot connect to other nodes allocated for the job. Once I turn the firewall off on compute nodes openmpi jobs run fine. Our firewall rules *do* allow inter-node ssh, and I have verified that this is so. What port range can I open to allow pbsdsh to work? Thanks in advance, Dave -- David Chin, Ph.D. chindw at wfu.edu ? ? ? ? ? ? ? ? ?High Performance Computing Systems Analyst Office: +1.336.758.2964? ? ? ? ?Wake Forest University Mobile: +1.336.608.0793? ? ? ? ?Winston-Salem, NC Email-to-txt: 3366080793 at mms.att.net ? ? ? ? ? Google Talk: chindw at wfu.edu Web: http://www.wfu.edu/~chindw ? ? ?https://plus.google.com/108169173177119739731/about From kenneth at sdsc.edu Fri Apr 27 15:28:14 2012 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Fri, 27 Apr 2012 14:28:14 -0700 (PDT) Subject: [torqueusers] interactive qsub failure Message-ID: I'm seeing an intermittent failure with qsub -I The message in /var/log/messages is: Apr 27 14:07:27 gcn-17-71 pbs_mom: LOG_ERROR::Operation now in progress (115) in TMomFinalizeChild, cannot open interactive qsub socket to host gordon-ln4.local:50620 - 'cannot connect to port 1023 in client_to_svr - connection refused' - check routing tables/multi-homed host issues I think my routing is okay, as I can telnet to the the login node port from the compute node. I also see some packet exchange to the port with tcpdump. Could the mom be attempting the connection before qsub starts listening? I would have thought qsub would start listening before sending the job to pbs_server. Any ideas on what might cause this? Thanks, Kenneth From gabe at msi.umn.edu Fri Apr 27 15:55:43 2012 From: gabe at msi.umn.edu (Gabe Turner) Date: Fri, 27 Apr 2012 16:55:43 -0500 Subject: [torqueusers] interactive qsub failure In-Reply-To: References: Message-ID: <20120427215543.GQ3577@blackice.msi.umn.edu> On Fri, Apr 27, 2012 at 02:28:14PM -0700, Kenneth Yoshimoto wrote: > > I'm seeing an intermittent failure with qsub -I > > The message in /var/log/messages is: > Apr 27 14:07:27 gcn-17-71 pbs_mom: LOG_ERROR::Operation now in progress (115) in TMomFinalizeChild, cannot open interactive qsub socket to host gordon-ln4.local:50620 - 'cannot connect to port 1023 in client_to_svr - connection refused' - check routing tables/multi-homed host issues > > I think my routing is okay, as I can telnet to the the login node > port from the compute node. I also see some packet exchange to > the port with tcpdump. Could the mom be attempting the connection > before qsub starts listening? I would have thought qsub would > start listening before sending the job to pbs_server. Any ideas > on what might cause this? In order for an interactive session to work, the compute node needs to make a connection back to the submission host, so you'll want to make sure that your firewall rules allow that. -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From mej at lbl.gov Fri Apr 27 17:01:50 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 27 Apr 2012 16:01:50 -0700 Subject: [torqueusers] interactive qsub failure In-Reply-To: References: Message-ID: <20120427230149.GX9750@lbl.gov> On Friday, 27 April 2012, at 14:28:14 (-0700), Kenneth Yoshimoto wrote: > > I'm seeing an intermittent failure with qsub -I > > The message in /var/log/messages is: > Apr 27 14:07:27 gcn-17-71 pbs_mom: LOG_ERROR::Operation now in progress (115) in TMomFinalizeChild, cannot open interactive qsub socket to host gordon-ln4.local:50620 - 'cannot connect to port 1023 in client_to_svr - connection refused' - check routing tables/multi-homed host issues > > I think my routing is okay, as I can telnet to the the login node > port from the compute node. I also see some packet exchange to > the port with tcpdump. Could the mom be attempting the connection > before qsub starts listening? I would have thought qsub would > start listening before sending the job to pbs_server. Any ideas > on what might cause this? Are you by any chance seeing a SYN, a SYN/ACK, and a RST? If so, try setting $max_conn_timeout_micro_sec to 500000 in your pbs_mom config and see if that helps. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From stevenx.a.duchene at intel.com Fri Apr 27 21:21:05 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Sat, 28 Apr 2012 03:21:05 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session Message-ID: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today. Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results. I was hoping the update to current sources would fix these problems but no such luck. If I run the following: qsub -I -l nodes=7 -l arch=atomN570 from my pbs job submission host I get: qsub: waiting for job 4.login2.sep.here to start qsub: job 4.login2.sep.here ready and then I get a shell prompt on the node 0 of this job. If I then do: $ echo $PBS_NODEFILE /var/spool/torque/aux//4.login2.sep.here And then: $ cat /var/spool/torque/aux//4.login2.sep.here atom255 atom255 atom255 atom255 atom254 atom254 atom254 and then I try: $ pbsdsh -h atom254 ls /tmp pbsdsh: error from tm_poll() 17002 Alternatively if I use the -v option it says: $ pbsdsh -v -h atom254 /bin/ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) Then when I exit the shell I get back to my job submission node but when I then check for the pbs_mom process on that node 0 for the job that I just had the shell prompt on, the pbs_mom process is no longer running and the status message from the init script says: pbs_mom dead but subsys locked Also if I run any mpi jobs through torque that are run on cpu cores on one system, they seem to run just fine but for mpi jobs where the job spans across separate nodes I get the following error out each time: $ cat script_viking_7nodes.pbs.e12 [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return status = 17002 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- viking11.sep.here - daemon did not report back when launched The PBS job script for this MPI job is very simple: #PBS -l nodes=7:Viking:ppn=1 #PBS -l arch=xeon mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes Also if I ssh to any node in my cluster and create a mpi nodes file by hand and then use that to run like this: mpirun --machinefile ../nodefile ./mpi_hello_hostname I get back all of the expected output from all of the nodes listed in the manually created nodes file. -- Steven DuChene -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120428/bd052e89/attachment.html From knielson at adaptivecomputing.com Fri Apr 27 22:23:09 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 27 Apr 2012 22:23:09 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> Message-ID: On Fri, Apr 27, 2012 at 9:21 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just > today.**** > > Earlier today I was running the 4.0-fixes tree from 04/03 and I had the > same results.**** > > I was hoping the update to current sources would fix these problems but no > such luck.**** > > ** ** > > If I run the following:**** > > ** ** > > qsub -I -l nodes=7 -l arch=atomN570**** > > ** ** > > from my pbs job submission host I get:**** > > ** ** > > qsub: waiting for job 4.login2.sep.here to start**** > > qsub: job 4.login2.sep.here ready**** > > ** ** > > and then I get a shell prompt on the node 0 of this job.**** > > ** ** > > If I then do:**** > > ** ** > > $ echo $PBS_NODEFILE**** > > /var/spool/torque/aux//4.login2.sep.here**** > > ** ** > > And then:**** > > ** ** > > $ cat /var/spool/torque/aux//4.login2.sep.here**** > > atom255**** > > atom255**** > > atom255**** > > atom255**** > > atom254**** > > atom254**** > > atom254**** > > ** ** > > and then I try:**** > > ** ** > > $ pbsdsh -h atom254 ls /tmp**** > > pbsdsh: error from tm_poll() 17002**** > > ** ** > > Alternatively if I use the ?v option it says:**** > > ** ** > > $ pbsdsh -v -h atom254 /bin/ls /tmp**** > > pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)**** > > > Steve, I am able to reproduce the SIGABRT on the MOM. We will get this fixed. Thanks for the help. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120427/416cd587/attachment-0001.html From stevenx.a.duchene at intel.com Fri Apr 27 22:29:17 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Sat, 28 Apr 2012 04:29:17 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF806059D82@ORSMSX106.amr.corp.intel.com> I don't suppose you have any idea why I am having tm connect problems in general do you? Or any ideas about what I could look at? -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Friday, April 27, 2012 9:23 PM To: Torque Users Mailing List Cc: Brady Kimball; David Hill; Ryan Chabot Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session On Fri, Apr 27, 2012 at 9:21 PM, DuChene, StevenX A > wrote: I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today. Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results. I was hoping the update to current sources would fix these problems but no such luck. If I run the following: qsub -I -l nodes=7 -l arch=atomN570 from my pbs job submission host I get: qsub: waiting for job 4.login2.sep.here to start qsub: job 4.login2.sep.here ready and then I get a shell prompt on the node 0 of this job. If I then do: $ echo $PBS_NODEFILE /var/spool/torque/aux//4.login2.sep.here And then: $ cat /var/spool/torque/aux//4.login2.sep.here atom255 atom255 atom255 atom255 atom254 atom254 atom254 and then I try: $ pbsdsh -h atom254 ls /tmp pbsdsh: error from tm_poll() 17002 Alternatively if I use the -v option it says: $ pbsdsh -v -h atom254 /bin/ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) Steve, I am able to reproduce the SIGABRT on the MOM. We will get this fixed. Thanks for the help. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120428/f749a065/attachment.html From roy.dragseth at cc.uit.no Sun Apr 29 11:03:14 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Sun, 29 Apr 2012 19:03:14 +0200 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> Message-ID: <1509369.qy0g8YhCvS@lux> I can confirm this too (reported the same problem to the mailing list a month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly the same system config. r. On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today. Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results. I was hoping the update to current sources would fix these problems but no such luck. If I run the following: qsub -I -l nodes=7 -l arch=atomN570 from my pbs job submission host I get: qsub: waiting for job 4.login2.sep.here to start qsub: job 4.login2.sep.here ready and then I get a shell prompt on the node 0 of this job. If I then do: $ echo $PBS_NODEFILE /var/spool/torque/aux//4.login2.sep.here And then: $ cat /var/spool/torque/aux//4.login2.sep.here atom255 atom255 atom255 atom255 atom254 atom254 atom254 and then I try: $ pbsdsh -h atom254 ls /tmp pbsdsh: error from tm_poll() 17002 Alternatively if I use the ?v option it says: $ pbsdsh -v -h atom254 /bin/ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) Then when I exit the shell I get back to my job submission node but when I then check for the pbs_mom process on that node 0 for the job that I just had the shell prompt on, the pbs_mom process is no longer running and the status message from the init script says: pbs_mom dead but subsys locked Also if I run any mpi jobs through torque that are run on cpu cores on one system, they seem to run just fine but for mpi jobs where the job spans across separate nodes I get the following error out each time: $ cat script_viking_7nodes.pbs.e12 [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return status = 17002 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- viking11.sep.here - daemon did not report back when launched The PBS job script for this MPI job is very simple: #PBS -l nodes=7:Viking:ppn=1 #PBS -l arch=xeon mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes Also if I ssh to any node in my cluster and create a mpi nodes file by hand and then use that to run like this: mpirun --machinefile ../nodefile ./mpi_hello_hostname I get back all of the expected output from all of the nodes listed in the manually created nodes file. -- Steven DuChene -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120429/4b9051a3/attachment-0001.html From sebelk at gmail.com Sun Apr 29 22:26:55 2012 From: sebelk at gmail.com (sebelk at gmail.com) Date: Mon, 30 Apr 2012 00:26:55 -0400 Subject: [torqueusers] hi List Message-ID: <4f9e6869.c585cc0a.6116.fffff91f@mx.google.com> hey List from now on ill never do anything other than this my only advice is get going on this as soon as possible http://t.co/SChsh7eC From ir_m_m_a_p at yahoo.com Mon Apr 30 06:00:58 2012 From: ir_m_m_a_p at yahoo.com (meysam miralipoor) Date: Mon, 30 Apr 2012 05:00:58 -0700 (PDT) Subject: [torqueusers] BLCR and Torque Message-ID: <1335787258.52756.YahooMailNeo@web160105.mail.bf1.yahoo.com> Hi Although integrating BLCR and Torque already documented but i didn't find a reasonable solution for my problems with check pointing. when i try to check point a job by qchkpt it seems all things are good but no checkpoint file created. I find following error in /var/log/messages Apr 30 11:01:10 node1 checkpoint_script: Invoked: /var/spool/torque/mom_priv/checkpoint_script 7366 1.server root root /var/spool/torque/checkpoint/1.server.CK ckpt.1.server.1335798070 0 - Apr 30 11:01:10 node1 kernel: blcr: Retry request on -CR_ENOSUPPORT Apr 30 11:01:10 node1 checkpoint_script: Subcommand (cr_checkpoint --tree 7366 --file ckpt.1.server.1335798070) failed with rc=52:#012- Retry request on -CR_ENOSUPPORT#012Checkpoint failed: support missing from application Apr 30 11:01:10 node1 pbs_mom: LOG_ERROR::blcr_checkpoint_job, checkpoint script returned value 52 And here i provide additional information **************PBS Script******************* #!/bin/sh # Beginning of PBS batch script. #PBS -l nodes=1:ppn=4 ##PBS -j oe #PBS -o /share/output$JOB_ID.log #PBS -e /share/error$JOB_ID.log #PBS -N NOTMPI #PBS -q batch export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/lib64 /share/ex2 # End of PBS batch script. **************status of relevant process on node1****************** #ps -A|grep 7336? 7366 ? ? ? ? ?00:00:00 bash #ps -A|grep ex2 7368 ? ? ? ? ?00:48:07 ex2 *********************************************************** May be it is?useful?to know that i can check point that running ex2(7368) process by using cr_checkpoint but?check pointing?bash(7366) process return same error message. Any help is appreciated Meysam miralipoor at ipm.ir -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/cf186098/attachment.html From knielson at adaptivecomputing.com Mon Apr 30 09:06:24 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 30 Apr 2012 09:06:24 -0600 Subject: [torqueusers] openmpi, pbsdsh, and iptables In-Reply-To: References: Message-ID: When pbsdsh runs or MPI the Mother Superior opens two dynamic ports for all of the sister nodes to connect to and send their stdout and stderr information. The sister nodes use dynamic ports as well. Ken On Fri, Apr 27, 2012 at 1:09 PM, David Chin wrote: > I am having some trouble getting openmpi-1.5.5 working with Torque > integration, and I have narrowed down the issue to firewall rules. > > However, I am not sure which ports to open up. > > I compiled openmpi-1.5.5 on RHEL6 against the libtorque that > EPEL provides -- version 2.5.7-9.el6. I found that openmpi jobs > cannot communicate between nodes if iptables is running. I have > also found, by starting an interactive job, that pbsdsh cannot > connect to other nodes allocated for the job. Once I turn the > firewall off on compute nodes openmpi jobs run fine. > > Our firewall rules *do* allow inter-node ssh, and I have verified > that this is so. > > What port range can I open to allow pbsdsh to work? > > Thanks in advance, > Dave > > -- > David Chin, Ph.D. > chindw at wfu.edu High Performance Computing Systems Analyst > Office: +1.336.758.2964 Wake Forest University > Mobile: +1.336.608.0793 Winston-Salem, NC > Email-to-txt: 3366080793 at mms.att.net Google Talk: chindw at wfu.edu > Web: http://www.wfu.edu/~chindw > https://plus.google.com/108169173177119739731/about > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/f153fddf/attachment.html From dbeer at adaptivecomputing.com Mon Apr 30 09:59:39 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Apr 2012 09:59:39 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <1509369.qy0g8YhCvS@lux> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <1509369.qy0g8YhCvS@lux> Message-ID: Roy, I'm sorry your message wasn't attended to - its the nature of a mailing list that sometimes messages are lost. Can you (or Steve) describe the TM connect issues in a bit more detail? We should get a bug report filed for this. David On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth wrote: > ** > > I can confirm this too (reported the same problem to the mailing list a > month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly > the same system config. > > > > r. > > > > > > On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: > > I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just > today. > > Earlier today I was running the 4.0-fixes tree from 04/03 and I had the > same results. > > I was hoping the update to current sources would fix these problems but no > such luck. > > > > If I run the following: > > > > qsub -I -l nodes=7 -l arch=atomN570 > > > > from my pbs job submission host I get: > > > > qsub: waiting for job 4.login2.sep.here to start > > qsub: job 4.login2.sep.here ready > > > > and then I get a shell prompt on the node 0 of this job. > > > > If I then do: > > > > $ echo $PBS_NODEFILE > > /var/spool/torque/aux//4.login2.sep.here > > > > And then: > > > > $ cat /var/spool/torque/aux//4.login2.sep.here > > atom255 > > atom255 > > atom255 > > atom255 > > atom254 > > atom254 > > atom254 > > > > and then I try: > > > > $ pbsdsh -h atom254 ls /tmp > > pbsdsh: error from tm_poll() 17002 > > > > Alternatively if I use the ?v option it says: > > > > $ pbsdsh -v -h atom254 /bin/ls /tmp > > pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) > > > > Then when I exit the shell I get back to my job submission node but when I > then check for the pbs_mom process on that node 0 for the job that I just > had the shell prompt on, the pbs_mom process is no longer running and the > status message from the init script says: > > > > pbs_mom dead but subsys locked > > > > Also if I run any mpi jobs through torque that are run on cpu cores on one > system, they seem to run just fine but for mpi jobs where the job spans > across separate nodes I get the following error out each time: > > > > $ cat script_viking_7nodes.pbs.e12 > > [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, > return status = 17002 > > -------------------------------------------------------------------------- > > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > > launch so we are aborting. > > > > There may be more information reported by the environment (see above). > > > > This may be because the daemon was unable to find all the needed shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun was unable to cleanly terminate the daemons on the nodes shown > > below. Additional manual cleanup may be required - please refer to > > the "orte-clean" tool for assistance. > > -------------------------------------------------------------------------- > > viking11.sep.here - daemon did not report back when launched > > > > The PBS job script for this MPI job is very simple: > > > > #PBS -l nodes=7:Viking:ppn=1 > > #PBS -l arch=xeon > > > > mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > > hello_out_viking7nodes > > > > Also if I ssh to any node in my cluster and create a mpi nodes file by > hand and then use that to run like this: > > > > mpirun --machinefile ../nodefile ./mpi_hello_hostname > > > > I get back all of the expected output from all of the nodes listed in the > manually created nodes file. > > -- > > Steven DuChene > > > > > > -- > > > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > > Roy Dragseth, Team Leader, High Performance Computing > > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/5fab8446/attachment-0001.html From stevenx.a.duchene at intel.com Mon Apr 30 10:20:51 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 30 Apr 2012 16:20:51 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <1509369.qy0g8YhCvS@lux> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF80605DED9@ORSMSX106.amr.corp.intel.com> If you can tell me what additional information you need I can try to obtain it. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, April 30, 2012 9:00 AM To: Torque Users Mailing List Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session Roy, I'm sorry your message wasn't attended to - its the nature of a mailing list that sometimes messages are lost. Can you (or Steve) describe the TM connect issues in a bit more detail? We should get a bug report filed for this. David On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth wrote: I can confirm this too (reported the same problem to the mailing list a month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly the same system config. ? r. ? ? On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today. Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results. I was hoping the update to current sources would fix these problems but no such luck. ? If I run the following: ? qsub -I -l nodes=7 -l arch=atomN570 ? from my pbs job submission host I get: ? qsub: waiting for job 4.login2.sep.here to start qsub: job 4.login2.sep.here ready ? and then I get a shell prompt on the node 0 of this job. ? If I then do: ? $ echo $PBS_NODEFILE /var/spool/torque/aux//4.login2.sep.here ? And then: ? $ cat /var/spool/torque/aux//4.login2.sep.here atom255 atom255 atom255 atom255 atom254 atom254 atom254 ? and then I try: ? $ pbsdsh -h atom254 ls /tmp pbsdsh: error from tm_poll() 17002 ? Alternatively if I use the -v option it says: ? $ pbsdsh -v -h atom254 /bin/ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) ? Then when I exit the shell I get back to my job submission node but when I then check for the pbs_mom process on that node 0 for the job that I just had the shell prompt on, the pbs_mom process is no longer running and the status message from the init script says: ? pbs_mom dead but subsys locked ? Also if I run any mpi jobs through torque that are run on cpu cores on one system, they seem to run just fine but for mpi jobs where the job spans across separate nodes I get the following error out each time: ? $ cat script_viking_7nodes.pbs.e12 [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return status = 17002 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1? while attempting to launch so we are aborting. ? There may be more information reported by the environment (see above). ? This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- ????? ??viking11.sep.here - daemon did not report back when launched ? The PBS job script for this MPI job is very simple: ? #PBS -l nodes=7:Viking:ppn=1 #PBS -l arch=xeon ? mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes ? Also if I ssh to any node in my cluster and create a mpi nodes file by hand and then use that to run like this: ? mpirun --machinefile ../nodefile ./mpi_hello_hostname ? I get back all of the expected output from all of the nodes listed in the manually created nodes file. -- Steven DuChene ? -- ? The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no ? _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing From dbeer at adaptivecomputing.com Mon Apr 30 10:57:35 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Apr 2012 10:57:35 -0600 Subject: [torqueusers] Call For Test Cases Message-ID: All, I don't know how much of this has been announced officially to the community, but for the past several months TORQUE has become more integrated with our company's QA processes. We have had hundreds of new regression tests written for TORQUE and as of 4.0 we have started writing unit tests for TORQUE functions (obviously the unit tests are a work in progress). We're really excited about the fact that TORQUE is getting more QA love from our company and we feel its going to help us produce far more reliable releases. However, one problem we face is that sometimes our test cases aren't matching up well with what some of our users are doing, especially when it comes to integrating with different pieces of middleware. We would love to start creating regression tests (or adapting existing ones) to match people's use cases more fully. Obviously, I can't promise that all of the test cases will be written this week, but this is something our company is committed to. We know that improving this will only make all of our lives easier. What we're hoping to receive would be detailed use cases that should be tested. Please remember that most of our QA team aren't experts in TORQUE, so the use cases should be explained in as much detail as possible - step by step instructions are excellent. If it requires using a specific library, or a specific MPI, etc., please include how to download these things. Additionally, if you have some test scripts that mirror what your users do in production, I'm sure our QA guys would appreciate it. Please submit these use cases through bugzilla so that we can keep them as organized as possible. We appreciate your help as we try to greatly improve the quality of TORQUE. -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/3bc35435/attachment.html From stevenx.a.duchene at intel.com Mon Apr 30 11:36:02 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 30 Apr 2012 17:36:02 +0000 Subject: [torqueusers] Call For Test Cases In-Reply-To: References: Message-ID: <560DBE57F33C4C4C9FBF11C662951AF80605FFA9@ORSMSX106.amr.corp.intel.com> That is great but what use cases or QA tests are you already covering? I am running torque-4.0.1 with maui-3.3.1 as the scheduler and I usually test with some simple MPI tests using openmpi-1.4.4 (what I have installed now). These MPI tests are simple things like hello world programs that run mpi tasks across several nodes and report back with task numbers and hostnames. I have some more complex MPI tests but the simple ones have to work first before I move onto them. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, April 30, 2012 9:58 AM To: Torque Users Mailing List Subject: [torqueusers] Call For Test Cases All, I don't know how much of this has been announced officially to the community, but for the past several months TORQUE has become more integrated with our company's QA processes. We have had hundreds of new regression tests written for TORQUE and as of 4.0 we have started writing unit tests for TORQUE functions (obviously the unit tests are a work in progress). We're really excited about the fact that TORQUE is getting more QA love from our company and we feel its going to help us produce far more reliable releases. However, one problem we face is that sometimes our test cases aren't matching up well with what some of our users are doing, especially when it comes to integrating with different pieces of middleware. We would love to start creating regression tests (or adapting existing ones) to match people's use cases more fully. Obviously, I can't promise that all of the test cases will be written this week, but this is something our company is committed to. We know that improving this will only make all of our lives easier. What we're hoping to receive would be detailed use cases that should be tested. Please remember that most of our QA team aren't experts in TORQUE, so the use cases should be explained in as much detail as possible - step by step instructions are excellent. If it requires using a specific library, or a specific MPI, etc., please include how to download these things. Additionally, if you have some test scripts that mirror what your users do in production, I'm sure our QA guys would appreciate it. Please submit these use cases through bugzilla so that we can keep them as organized as possible. We appreciate your help as we try to greatly improve the quality of TORQUE. -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/712b0008/attachment.html From dbeer at adaptivecomputing.com Mon Apr 30 11:55:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Apr 2012 11:55:42 -0600 Subject: [torqueusers] Call For Test Cases In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF80605FFA9@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF80605FFA9@ORSMSX106.amr.corp.intel.com> Message-ID: On Mon, Apr 30, 2012 at 11:36 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > That is great but what use cases or QA tests are you already covering?*** > * > > ** ** > > I am running torque-4.0.1 with maui-3.3.1 as the scheduler and I usually > test with some simple MPI tests using openmpi-1.4.4 (what I have installed > now). These MPI tests are simple things like hello world programs that run > mpi tasks across several nodes and report back with task numbers and > hostnames. I have some more complex MPI tests but the simple ones have to > work first before I move onto them.**** > > --**** > > Steven DuChene > Steve, There are currently a few hundred regression tests, but the bugs you've found have helped us realize that especially in MPI integration our tests aren't very good. Currently our test cases use pbsdsh to exercise the TM interface and verify that it is working. We recognize that this is inadequate and that's part of the reason we're doing this. You'd probably be interested to know that regression tests are being written about the cases that you've reported as part of fixing the bug (this is a prerequisite for any bug to be considered fixed). The first thing we plan to tackle in this initiative is to get some openmpi regression tests written in addition to just using pbsdsh. Specific scripts you'd like to see tested would be very helpful. Currently, there are no regression tests written around Maui, and I apologize as I know you did that testing for us. I will try to pull this in to the initiative but I feel I should warn you that I know this will be considered a lower priority than the MPI tests. David > **** > > ** ** > > *From:* torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer > *Sent:* Monday, April 30, 2012 9:58 AM > *To:* Torque Users Mailing List > *Subject:* [torqueusers] Call For Test Cases**** > > ** ** > > All,**** > > ** ** > > I don't know how much of this has been announced officially to the > community, but for the past several months TORQUE has become more > integrated with our company's QA processes. We have had hundreds of new > regression tests written for TORQUE and as of 4.0 we have started writing > unit tests for TORQUE functions (obviously the unit tests are a work in > progress). We're really excited about the fact that TORQUE is getting more > QA love from our company and we feel its going to help us produce far more > reliable releases. **** > > ** ** > > However, one problem we face is that sometimes our test cases aren't > matching up well with what some of our users are doing, especially when it > comes to integrating with different pieces of middleware. We would love to > start creating regression tests (or adapting existing ones) to match > people's use cases more fully. Obviously, I can't promise that all of the > test cases will be written this week, but this is something our company is > committed to. We know that improving this will only make all of our lives > easier. **** > > ** ** > > What we're hoping to receive would be detailed use cases that should be > tested. Please remember that most of our QA team aren't experts in TORQUE, > so the use cases should be explained in as much detail as possible - step > by step instructions are excellent. If it requires using a specific > library, or a specific MPI, etc., please include how to download these > things. Additionally, if you have some test scripts that mirror what your > users do in production, I'm sure our QA guys would appreciate it. Please > submit these use cases through bugzilla so that we can keep them as > organized as possible. We appreciate your help as we try to greatly improve > the quality of TORQUE.**** > > ** ** > > -- **** > > David Beer | Software Engineer**** > > Adaptive Computing**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/03db115e/attachment-0001.html From dbeer at adaptivecomputing.com Mon Apr 30 11:57:15 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 30 Apr 2012 11:57:15 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF80605DED9@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <1509369.qy0g8YhCvS@lux> <560DBE57F33C4C4C9FBF11C662951AF80605DED9@ORSMSX106.amr.corp.intel.com> Message-ID: Steve, The two things that seem that they'd be the most helpful would be the text of any error messages you're seeing and steps to reproduce. Thanks, David On Mon, Apr 30, 2012 at 10:20 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > If you can tell me what additional information you need I can try to > obtain it. > -- > Steven DuChene > > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of David Beer > Sent: Monday, April 30, 2012 9:00 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session > > Roy, > > I'm sorry your message wasn't attended to - its the nature of a mailing > list that sometimes messages are lost. Can you (or Steve) describe the TM > connect issues in a bit more detail? We should get a bug report filed for > this. > > David > On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth > wrote: > I can confirm this too (reported the same problem to the mailing list a > month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly > the same system config. > > r. > > > On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: > I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just > today. > Earlier today I was running the 4.0-fixes tree from 04/03 and I had the > same results. > I was hoping the update to current sources would fix these problems but no > such luck. > > If I run the following: > > qsub -I -l nodes=7 -l arch=atomN570 > > from my pbs job submission host I get: > > qsub: waiting for job 4.login2.sep.here to start > qsub: job 4.login2.sep.here ready > > and then I get a shell prompt on the node 0 of this job. > > If I then do: > > $ echo $PBS_NODEFILE > /var/spool/torque/aux//4.login2.sep.here > > And then: > > $ cat /var/spool/torque/aux//4.login2.sep.here > atom255 > atom255 > atom255 > atom255 > atom254 > atom254 > atom254 > > and then I try: > > $ pbsdsh -h atom254 ls /tmp > pbsdsh: error from tm_poll() 17002 > > Alternatively if I use the -v option it says: > > $ pbsdsh -v -h atom254 /bin/ls /tmp > pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) > > Then when I exit the shell I get back to my job submission node but when I > then check for the pbs_mom process on that node 0 for the job that I just > had the shell prompt on, the pbs_mom process is no longer running and the > status message from the init script says: > > pbs_mom dead but subsys locked > > Also if I run any mpi jobs through torque that are run on cpu cores on one > system, they seem to run just fine but for mpi jobs where the job spans > across separate nodes I get the following error out each time: > > $ cat script_viking_7nodes.pbs.e12 > [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, > return status = 17002 > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > viking11.sep.here - daemon did not report back when launched > > The PBS job script for this MPI job is very simple: > > #PBS -l nodes=7:Viking:ppn=1 > #PBS -l arch=xeon > > mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > > hello_out_viking7nodes > > Also if I ssh to any node in my cluster and create a mpi nodes file by > hand and then use that to run like this: > > mpirun --machinefile ../nodefile ./mpi_hello_hostname > > I get back all of the expected output from all of the nodes listed in the > manually created nodes file. > -- > Steven DuChene > > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/8ef95ff6/attachment.html From stevenx.a.duchene at intel.com Mon Apr 30 12:56:35 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 30 Apr 2012 18:56:35 +0000 Subject: [torqueusers] Call For Test Cases In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF80605FFA9@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF8060610A5@ORSMSX106.amr.corp.intel.com> Here is some additional MPI testing info. Simple MPI code: /* program hello */ /* Adapted from mpihello.f by drs */ #include #include int main(int argc, char **argv) { int *buf, i, rank, nints, len; char hostname[256]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); gethostname(hostname,255); printf("Hello world! I am process number: %d on host %s\n", rank, hostname); MPI_Finalize(); return 0; } After installing openmpi this is how to compile that code: mpicc mpi_hello_hostname.c -o mpi_hello_hostname To run this code within torque I use the following PBS job script: #PBS -l nodes=7:Viking:ppn=1 #PBS -l arch=xeon #PBS -N MPI_7nodes_viking mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes and this is run with a simple qsub command: qsub script_viking_7nodes.pbs The expected output should be: Hello world! I am process number: 2 on host viking12.sep.here Hello world! I am process number: 0 on host viking13.sep.here Hello world! I am process number: 1 on host viking14.sep.here Hello world! I am process number: 3 on host viking15.sep.here Hello world! I am process number: 6 on host viking11.sep.here Hello world! I am process number: 4 on host viking17.sep.here Hello world! I am process number: 5 on host viking16.sep.here From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, April 30, 2012 10:56 AM To: Torque Users Mailing List Subject: Re: [torqueusers] Call For Test Cases On Mon, Apr 30, 2012 at 11:36 AM, DuChene, StevenX A > wrote: That is great but what use cases or QA tests are you already covering? I am running torque-4.0.1 with maui-3.3.1 as the scheduler and I usually test with some simple MPI tests using openmpi-1.4.4 (what I have installed now). These MPI tests are simple things like hello world programs that run mpi tasks across several nodes and report back with task numbers and hostnames. I have some more complex MPI tests but the simple ones have to work first before I move onto them. -- Steven DuChene Steve, There are currently a few hundred regression tests, but the bugs you've found have helped us realize that especially in MPI integration our tests aren't very good. Currently our test cases use pbsdsh to exercise the TM interface and verify that it is working. We recognize that this is inadequate and that's part of the reason we're doing this. You'd probably be interested to know that regression tests are being written about the cases that you've reported as part of fixing the bug (this is a prerequisite for any bug to be considered fixed). The first thing we plan to tackle in this initiative is to get some openmpi regression tests written in addition to just using pbsdsh. Specific scripts you'd like to see tested would be very helpful. Currently, there are no regression tests written around Maui, and I apologize as I know you did that testing for us. I will try to pull this in to the initiative but I feel I should warn you that I know this will be considered a lower priority than the MPI tests. David From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, April 30, 2012 9:58 AM To: Torque Users Mailing List Subject: [torqueusers] Call For Test Cases All, I don't know how much of this has been announced officially to the community, but for the past several months TORQUE has become more integrated with our company's QA processes. We have had hundreds of new regression tests written for TORQUE and as of 4.0 we have started writing unit tests for TORQUE functions (obviously the unit tests are a work in progress). We're really excited about the fact that TORQUE is getting more QA love from our company and we feel its going to help us produce far more reliable releases. However, one problem we face is that sometimes our test cases aren't matching up well with what some of our users are doing, especially when it comes to integrating with different pieces of middleware. We would love to start creating regression tests (or adapting existing ones) to match people's use cases more fully. Obviously, I can't promise that all of the test cases will be written this week, but this is something our company is committed to. We know that improving this will only make all of our lives easier. What we're hoping to receive would be detailed use cases that should be tested. Please remember that most of our QA team aren't experts in TORQUE, so the use cases should be explained in as much detail as possible - step by step instructions are excellent. If it requires using a specific library, or a specific MPI, etc., please include how to download these things. Additionally, if you have some test scripts that mirror what your users do in production, I'm sure our QA guys would appreciate it. Please submit these use cases through bugzilla so that we can keep them as organized as possible. We appreciate your help as we try to greatly improve the quality of TORQUE. -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/e65e61f4/attachment-0001.html From stevenx.a.duchene at intel.com Mon Apr 30 14:24:58 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 30 Apr 2012 20:24:58 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <1509369.qy0g8YhCvS@lux> <560DBE57F33C4C4C9FBF11C662951AF80605DED9@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF8060610CE@ORSMSX106.amr.corp.intel.com> I added a unique feature to two nodes (one unique feature for each node): qmgr -c "set node atom221 properties += yestest" qmgr -c "set node atom231 properties += testdie" Then as a regular user I submitted an interactive job like this: qsub -I -l nodes=1:testdie+1:yestest I got this back: qsub: waiting for job 25.login2.sep.here to start qsub: job 25.login2.sep.here ready and then I got a shell on one of the nodes with the unique features. I checked the contents of the $PBS_NODEFILE: echo $PBS_NODEFILE /var/spool/torque/aux//25.login2.sep.here $ cat /var/spool/torque/aux//25.login2.sep.here atom231 atom221 Using pbsdsh I tried to run a command on the other node I had been assigned: $ pbsdsh -v -h atom221 ls /tmp pbsdsh: rescinfo from 0: Linux atom231 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64:nodes=1:testdie+1:yestest,walltime=01:00:00 pbsdsh: error from tm_poll() 17002 If I immediately try the same command again I get: $ pbsdsh -v -h atom221 ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) The output from "print server" is: # qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = login2 set server managers = maui at login2.sep.here set server managers += root at login2.sep.here set server operators = maui at login2.sep.here set server operators += root at login2.sep.here set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 300 set server submit_hosts = elogin1 set server next_job_number = 26 set server moab_array_compatible = True I don't have anything unusual in my maui.cfg file: # cat /usr/local/maui/maui.cfg |grep -v ^$|grep -v ^# SERVERHOST login2 ADMIN1 maui root RMCFG[ELOGIN2] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE ENABLEMULTIREQJOBS TRUE ENABLEMULTINODEJOBS TRUE JOBNODEMATCHPOLICY EXACTNODE From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, April 30, 2012 10:57 AM To: Torque Users Mailing List Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session Steve, The two things that seem that they'd be the most helpful would be the text of any error messages you're seeing and steps to reproduce. Thanks, David On Mon, Apr 30, 2012 at 10:20 AM, DuChene, StevenX A > wrote: If you can tell me what additional information you need I can try to obtain it. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Monday, April 30, 2012 9:00 AM To: Torque Users Mailing List Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session Roy, I'm sorry your message wasn't attended to - its the nature of a mailing list that sometimes messages are lost. Can you (or Steve) describe the TM connect issues in a bit more detail? We should get a bug report filed for this. David On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth > wrote: I can confirm this too (reported the same problem to the mailing list a month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly the same system config. r. On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today. Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results. I was hoping the update to current sources would fix these problems but no such luck. If I run the following: qsub -I -l nodes=7 -l arch=atomN570 from my pbs job submission host I get: qsub: waiting for job 4.login2.sep.here to start qsub: job 4.login2.sep.here ready and then I get a shell prompt on the node 0 of this job. If I then do: $ echo $PBS_NODEFILE /var/spool/torque/aux//4.login2.sep.here And then: $ cat /var/spool/torque/aux//4.login2.sep.here atom255 atom255 atom255 atom255 atom254 atom254 atom254 and then I try: $ pbsdsh -h atom254 ls /tmp pbsdsh: error from tm_poll() 17002 Alternatively if I use the -v option it says: $ pbsdsh -v -h atom254 /bin/ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) Then when I exit the shell I get back to my job submission node but when I then check for the pbs_mom process on that node 0 for the job that I just had the shell prompt on, the pbs_mom process is no longer running and the status message from the init script says: pbs_mom dead but subsys locked Also if I run any mpi jobs through torque that are run on cpu cores on one system, they seem to run just fine but for mpi jobs where the job spans across separate nodes I get the following error out each time: $ cat script_viking_7nodes.pbs.e12 [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return status = 17002 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- viking11.sep.here - daemon did not report back when launched The PBS job script for this MPI job is very simple: #PBS -l nodes=7:Viking:ppn=1 #PBS -l arch=xeon mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes Also if I ssh to any node in my cluster and create a mpi nodes file by hand and then use that to run like this: mpirun --machinefile ../nodefile ./mpi_hello_hostname I get back all of the expected output from all of the nodes listed in the manually created nodes file. -- Steven DuChene -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/79878952/attachment-0001.html From roy.dragseth at cc.uit.no Mon Apr 30 14:39:39 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Mon, 30 Apr 2012 22:39:39 +0200 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <1509369.qy0g8YhCvS@lux> Message-ID: <5712849.CGKjQuDxcD@lux> No, problem. I should probably have used the bugzilla for these kind of things anyway. I'll try to scramble together a more thorough bug report on this matter. BTW, is 4.0.X supposed to be config compatible with 3.0.X? That is, if 3.0.X works fine, can I assume that the same config will work on 4.0.X. Would any differences in behaviour indicate a bug is present? The reason I'm asking is that I have a fairly solid setup for v3.0.4 in my torque-roll for Rocks, but I'm having a hard time to get 4.0.X working in the same environment. Being pressed for time I would like a hint on where to look for problems. Regards, r. On Monday 30. April 2012 09.59.39 David Beer wrote: Roy, I'm sorry your message wasn't attended to - its the nature of a mailing list that sometimes messages are lost. Can you (or Steve) describe the TM connect issues in a bit more detail? We should get a bug report filed for this. David On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth wrote: I can confirm this too (reported the same problem to the mailing list a month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly the same system config. r. On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just today. Earlier today I was running the 4.0-fixes tree from 04/03 and I had the same results. I was hoping the update to current sources would fix these problems but no such luck. If I run the following: qsub -I -l nodes=7 -l arch=atomN570 from my pbs job submission host I get: qsub: waiting for job 4.login2.sep.here to start qsub: job 4.login2.sep.here ready and then I get a shell prompt on the node 0 of this job. If I then do: $ echo $PBS_NODEFILE /var/spool/torque/aux//4.login2.sep.here And then: $ cat /var/spool/torque/aux//4.login2.sep.here atom255 atom255 atom255 atom255 atom254 atom254 atom254 and then I try: $ pbsdsh -h atom254 ls /tmp pbsdsh: error from tm_poll() 17002 Alternatively if I use the ?v option it says: $ pbsdsh -v -h atom254 /bin/ls /tmp pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) Then when I exit the shell I get back to my job submission node but when I then check for the pbs_mom process on that node 0 for the job that I just had the shell prompt on, the pbs_mom process is no longer running and the status message from the init script says: pbs_mom dead but subsys locked Also if I run any mpi jobs through torque that are run on cpu cores on one system, they seem to run just fine but for mpi jobs where the job spans across separate nodes I get the following error out each time: $ cat script_viking_7nodes.pbs.e12 [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, return status = 17002 -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- viking11.sep.here - daemon did not report back when launched The PBS job script for this MPI job is very simple: #PBS -l nodes=7:Viking:ppn=1 #PBS -l arch=xeon mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > hello_out_viking7nodes Also if I ssh to any node in my cluster and create a mpi nodes file by hand and then use that to run like this: mpirun --machinefile ../nodefile ./mpi_hello_hostname I get back all of the expected output from all of the nodes listed in the manually created nodes file. -- Steven DuChene -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120430/5f5cb26a/attachment-0001.html From roy.dragseth at cc.uit.no Mon Apr 30 16:07:45 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 01 May 2012 00:07:45 +0200 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <5712849.CGKjQuDxcD@lux> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <5712849.CGKjQuDxcD@lux> Message-ID: <1907106.CD0yMpC1j1@lux> Here is a job excerpt demonstrating error messages when running pbsdsh. The pbs_mom thread die if you try to run pbsdsh -u. dmesg shows a segfault marve at hpc1 ~]$ qsub -I -lnodes=2:ppn=2,walltime=1000 qsub: waiting for job 13.hpc1.cc.uit.no to start qsub: job 13.hpc1.cc.uit.no ready [marve at compute-0-2 ~]$ pbsdsh uname -a Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux pbsdsh: Event poll failed, error TM_ENOTCONNECTED Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux pbsdsh: reconnected pbsdsh: Event poll failed, error TM_ENOTFOUND [marve at compute-0-2 ~]$ [marve at compute-0-2 ~]$ pbsdsh -u uname -a [marve at compute-0-2 ~]$ pbsdsh -u uname -a [marve at compute-0-2 ~]$ dmesg | tail -n1 pbs_mom[1980]: segfault at 20 ip 000000000040b240 sp 00007fffc1853820 error 4 in pbs_mom[400000+5f000] [marve at compute-0-2 ~]$ logout qsub: job 13.hpc1.cc.uit.no completed This is using torque-4.0.1-snap.201204031702.tar.gz (the problems related to getting 4.0.X up an running seems to be related to the fact that pbs_server now only listens to one interface, earlier it listened to all interfaces. I'll post a separate report for this) r. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/c3f9a4a0/attachment.html From kenneth at sdsc.edu Mon Apr 30 16:10:04 2012 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Mon, 30 Apr 2012 15:10:04 -0700 (PDT) Subject: [torqueusers] interactive qsub failure In-Reply-To: <20120427230149.GX9750@lbl.gov> References: <20120427230149.GX9750@lbl.gov> Message-ID: I do see the syn, syn/ack, rst pattern with the failed attempts. I'll give your suggestion a try. Thanks! Kenneth On Fri, 27 Apr 2012, Michael Jennings wrote: > Date: Fri, 27 Apr 2012 16:01:50 -0700 > From: Michael Jennings > Reply-To: Torque Users Mailing List > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] interactive qsub failure > > On Friday, 27 April 2012, at 14:28:14 (-0700), > Kenneth Yoshimoto wrote: > >> >> I'm seeing an intermittent failure with qsub -I >> >> The message in /var/log/messages is: >> Apr 27 14:07:27 gcn-17-71 pbs_mom: LOG_ERROR::Operation now in progress (115) in TMomFinalizeChild, cannot open interactive qsub socket to host gordon-ln4.local:50620 - 'cannot connect to port 1023 in client_to_svr - connection refused' - check routing tables/multi-homed host issues >> >> I think my routing is okay, as I can telnet to the the login node >> port from the compute node. I also see some packet exchange to >> the port with tcpdump. Could the mom be attempting the connection >> before qsub starts listening? I would have thought qsub would >> start listening before sending the job to pbs_server. Any ideas >> on what might cause this? > > Are you by any chance seeing a SYN, a SYN/ACK, and a RST? > > If so, try setting $max_conn_timeout_micro_sec to 500000 in your > pbs_mom config and see if that helps. > > Michael > > From roy.dragseth at cc.uit.no Mon Apr 30 16:24:17 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 01 May 2012 00:24:17 +0200 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? Message-ID: <1367310.7tLNPAkoDo@lux> On v3 and earlier pbs_server would listen to all interfaces per default, but this seems to have changed under v4. Version 3.0.2 # lsof -i | grep pbs pbs_serve 17765 root 8u IPv4 383811137 TCP *:15001 (LISTEN) pbs_serve 17765 root 10u IPv4 383811140 UDP *:15001 pbs_serve 17765 root 11u IPv4 383811141 UDP *:1023 Version 4.0.1 (snapshot) [root at hpc1 torque]# lsof -i | grep pbs pbs_serve 21247 root 7u IPv4 14268817 0t0 TCP *:63000 (LISTEN) pbs_serve 21247 root 8u IPv4 14268819 0t0 TCP hpc1.cc.uit.no:15001 (LISTEN) [root at hpc1 torque]# hostname hpc1.cc.uit.no under torque 4 pbs_server listen only to the interface associated with the hostname which makes it ignore all internal traffic thus compute nodes cannot contact it. Is it possble to get the old behaviour back? Using the -H flag doesn't help as it breaks other things when communicating with the compute nodes. The only way to get torque 4 functional was to set the frontend hostname to the same as the one on the internal interface, hpc1.local, which isn't a good solution. r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From stevenx.a.duchene at intel.com Mon Apr 30 16:58:28 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Mon, 30 Apr 2012 22:58:28 +0000 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: <1367310.7tLNPAkoDo@lux> References: <1367310.7tLNPAkoDo@lux> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF806061188@ORSMSX106.amr.corp.intel.com> Yes, I am seeing the same thing here with my two instances of torque. With torque-2.5.7 # lsof -i -n -P|grep pbs pbs_serve 32368 root 8u IPv4 242454945 0t0 TCP *:15001 (LISTEN) pbs_serve 32368 root 9u IPv4 242454946 0t0 UDP *:15001 pbs_serve 32368 root 10u IPv4 242454947 0t0 UDP *:1023 with torque-4.0.1 from torque-4.0.1.snap201204276052.tar.gz ]# lsof -i -n -P|grep pbs pbs_serve 2553 root 7u IPv4 14851 0t0 TCP *:63000 (LISTEN) pbs_serve 2553 root 8u IPv4 14853 0t0 TCP 192.168.8.27:15001 (LISTEN) pbs_serve 2553 root 11u IPv4 2592331 0t0 TCP 192.168.8.27:15001->192.168.8.27:39678 (ESTABLISHED) -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Roy Dragseth Sent: Monday, April 30, 2012 3:24 PM To: torqueusers at supercluster.org Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? On v3 and earlier pbs_server would listen to all interfaces per default, but this seems to have changed under v4. Version 3.0.2 # lsof -i | grep pbs pbs_serve 17765 root 8u IPv4 383811137 TCP *:15001 (LISTEN) pbs_serve 17765 root 10u IPv4 383811140 UDP *:15001 pbs_serve 17765 root 11u IPv4 383811141 UDP *:1023 Version 4.0.1 (snapshot) [root at hpc1 torque]# lsof -i | grep pbs pbs_serve 21247 root 7u IPv4 14268817 0t0 TCP *:63000 (LISTEN) pbs_serve 21247 root 8u IPv4 14268819 0t0 TCP hpc1.cc.uit.no:15001 (LISTEN) [root at hpc1 torque]# hostname hpc1.cc.uit.no under torque 4 pbs_server listen only to the interface associated with the hostname which makes it ignore all internal traffic thus compute nodes cannot contact it. Is it possble to get the old behaviour back? Using the -H flag doesn't help as it breaks other things when communicating with the compute nodes. The only way to get torque 4 functional was to set the frontend hostname to the same as the one on the internal interface, hpc1.local, which isn't a good solution. r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jouvin at lal.in2p3.fr Mon Apr 30 14:55:51 2012 From: jouvin at lal.in2p3.fr (Michel Jouvin) Date: Mon, 30 Apr 2012 22:55:51 +0200 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <5712849.CGKjQuDxcD@lux> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.inte l.com> <1509369.qy0g8YhCvS@lux> <5712849.CGKjQuDxcD@lux> Message-ID: Ca me fait r?aliser que l'appel des bennes est pass? ? l'as aujourd'hui. Mon muezin n'a pas sonn?... Je fais plusieurs noeuds ? mon mouchoir pour mercredi matin... Michel --On lundi 30 avril 2012 22:39 +0200 Roy Dragseth wrote: > No, problem. I should probably have used the bugzilla for these kind of > things anyway. > > I'll try to scramble together a more thorough bug report on this matter. > > BTW, is 4.0.X supposed to be config compatible with 3.0.X? That is, if > 3.0.X works fine, can I assume that the same config will work on 4.0.X. > Would any differences in behaviour indicate a bug is present? The > reason I'm asking is that I have a fairly solid setup for v3.0.4 in my > torque-roll for Rocks, but I'm having a hard time to get 4.0.X working > in the same environment. Being pressed for time I would like a hint on > where to look for problems. > > Regards, > r. > > > On Monday 30. April 2012 09.59.39 David Beer wrote: > > Roy, > > > I'm sorry your message wasn't attended to - its the nature of a mailing > list that sometimes messages are lost. Can you (or Steve) describe the > TM connect issues in a bit more detail? We should get a bug report filed > for this. > > > David > > > On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth > wrote: > > I can confirm this too (reported the same problem to the mailing list a > month ago, but nobody seemed to care). Torque 3.0.4 works fine with > exactly the same system config. > > r. > > > On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: > > I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just > today. Earlier today I was running the 4.0-fixes tree from 04/03 and I > had the same results. > I was hoping the update to current sources would fix these problems but > no such luck. > > If I run the following: > > qsub -I -l nodes=7 -l arch=atomN570 > > from my pbs job submission host I get: > > qsub: waiting for job 4.login2.sep.here to start > qsub: job 4.login2.sep.here ready > > and then I get a shell prompt on the node 0 of this job. > > If I then do: > > $ echo $PBS_NODEFILE > /var/spool/torque/aux//4.login2.sep.here > > And then: > > $ cat /var/spool/torque/aux//4.login2.sep.here > atom255 > atom255 > atom255 > atom255 > atom254 > atom254 > atom254 > > and then I try: > > $ pbsdsh -h atom254 ls /tmp > pbsdsh: error from tm_poll() 17002 > > Alternatively if I use the ?v option it says: > > $ pbsdsh -v -h atom254 /bin/ls /tmp > pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) > > Then when I exit the shell I get back to my job submission node but when > I then check for the pbs_mom process on that node 0 for the job that I > just had the shell prompt on, the pbs_mom process is no longer running > and the status message from the init script says: > > pbs_mom dead but subsys locked > > Also if I run any mpi jobs through torque that are run on cpu cores on > one system, they seem to run just fine but for mpi jobs where the job > spans across separate nodes I get the following error out each time: > > $ cat script_viking_7nodes.pbs.e12 > [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, > return status = 17002 > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > viking11.sep.here - daemon did not report back when launched > > The PBS job script for this MPI job is very simple: > ># PBS -l nodes=7:Viking:ppn=1 ># PBS -l arch=xeon > > mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > > hello_out_viking7nodes > > Also if I ssh to any node in my cluster and create a mpi nodes file by > hand and then use that to run like this: > > mpirun --machinefile ../nodefile ./mpi_hello_hostname > > I get back all of the expected output from all of the nodes listed in the > manually created nodes file. > -- > Steven DuChene > > > > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > -- > > David Beer | Software Engineer > Adaptive Computing > > > > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no ************************************************************* * Michel Jouvin Email : jouvin at lal.in2p3.fr * * LAL / CNRS Tel : +33 1 64468932 * * B.P. 34 Fax : +33 1 69079404 * * 91898 Orsay Cedex * * France * *************************************************************