From dbeer at adaptivecomputing.com Tue May 1 10:15:36 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 1 May 2012 10:15:36 -0600 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: <1367310.7tLNPAkoDo@lux> References: <1367310.7tLNPAkoDo@lux> Message-ID: Roy, TORQUE 4 intentionally doesn't listen for UDP communication because all communication in TORQUE 4.0 is TCP. This is done to increase reliability since router protocols allow them to drop UDP packets when they have a high load. By the way, listening on port 63000 was a bug introduced between 4.0.0 and 4.0.1 that has been fixed. David On Mon, Apr 30, 2012 at 4:24 PM, Roy Dragseth wrote: > On v3 and earlier pbs_server would listen to all interfaces per default, > but this seems to have changed under v4. > > Version 3.0.2 > # lsof -i | grep pbs > pbs_serve 17765 root 8u IPv4 383811137 TCP *:15001 (LISTEN) > pbs_serve 17765 root 10u IPv4 383811140 UDP *:15001 > pbs_serve 17765 root 11u IPv4 383811141 UDP *:1023 > > Version 4.0.1 (snapshot) > [root at hpc1 torque]# lsof -i | grep pbs > pbs_serve 21247 root 7u IPv4 14268817 0t0 TCP *:63000 (LISTEN) > pbs_serve 21247 root 8u IPv4 14268819 0t0 TCP > hpc1.cc.uit.no:15001 (LISTEN) > [root at hpc1 torque]# hostname > hpc1.cc.uit.no > > under torque 4 pbs_server listen only to the interface associated with the > hostname which makes it ignore all > internal traffic thus compute nodes cannot contact it. > > Is it possble to get the old behaviour back? Using the -H flag doesn't > help as it breaks other things when > communicating with the compute nodes. The only way to get torque 4 > functional was to set the frontend > hostname to the same as the one on the internal interface, hpc1.local, > which isn't a good solution. > > r. > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/927464d7/attachment-0001.html From stevenx.a.duchene at intel.com Tue May 1 10:22:48 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 1 May 2012 16:22:48 +0000 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: References: <1367310.7tLNPAkoDo@lux> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF80606129B@ORSMSX106.amr.corp.intel.com> So torque should not be listening on TCP port 63000 ? What SVN branch had this fixed? I am running branch 6052 I checked out on 4/27 and I still see it. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, May 01, 2012 9:16 AM To: Torque Users Mailing List Subject: Re: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? Roy, TORQUE 4 intentionally doesn't listen for UDP communication because all communication in TORQUE 4.0 is TCP. This is done to increase reliability since router protocols allow them to drop UDP packets when they have a high load. By the way, listening on port 63000 was a bug introduced between 4.0.0 and 4.0.1 that has been fixed. David On Mon, Apr 30, 2012 at 4:24 PM, Roy Dragseth wrote: On v3 and earlier pbs_server would listen to all interfaces per default, but this seems to have changed under v4. Version 3.0.2 # lsof -i | grep pbs pbs_serve 17765 ? ?root ? ?8u ?IPv4 383811137 ? ? ? TCP *:15001 (LISTEN) pbs_serve 17765 ? ?root ? 10u ?IPv4 383811140 ? ? ? UDP *:15001 pbs_serve 17765 ? ?root ? 11u ?IPv4 383811141 ? ? ? UDP *:1023 Version 4.0.1 (snapshot) [root at hpc1 torque]# lsof -i | grep pbs pbs_serve 21247 ? ?root ? ?7u ?IPv4 14268817 ? ? ?0t0 ?TCP *:63000 (LISTEN) pbs_serve 21247 ? ?root ? ?8u ?IPv4 14268819 ? ? ?0t0 ?TCP hpc1.cc.uit.no:15001 (LISTEN) [root at hpc1 torque]# hostname hpc1.cc.uit.no under torque 4 pbs_server listen only to the interface associated with the hostname which makes it ignore all internal traffic thus compute nodes cannot contact it. Is it possble to get the old behaviour back? ?Using the -H flag doesn't help as it breaks other things when communicating with the compute nodes. ?The only way to get torque 4 functional was to set the frontend hostname to the same as the one on the internal interface, hpc1.local, which isn't a good solution. r. -- ?The Computer Center, University of Troms?, N-9037 TROMS? Norway. ? ? ? ? ? ? ?phone:+47 77 64 41 07, fax:+47 77 64 41 00 ? ? ? ?Roy Dragseth, Team Leader, High Performance Computing ? ? ? ? Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing From dbeer at adaptivecomputing.com Tue May 1 10:51:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 1 May 2012 10:51:42 -0600 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF80606129B@ORSMSX106.amr.corp.intel.com> References: <1367310.7tLNPAkoDo@lux> <560DBE57F33C4C4C9FBF11C662951AF80606129B@ORSMSX106.amr.corp.intel.com> Message-ID: If you're checking out TORQUE and check out the current 4.0.1 it is fixed there. David On Tue, May 1, 2012 at 10:22 AM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > So torque should not be listening on TCP port 63000 ? > > What SVN branch had this fixed? I am running branch 6052 I checked out on > 4/27 and I still see it. > -- > Steven DuChene > > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of David Beer > Sent: Tuesday, May 01, 2012 9:16 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Configure pbs_server to listen to all > interfaces (torque 4)? > > Roy, > > TORQUE 4 intentionally doesn't listen for UDP communication because all > communication in TORQUE 4.0 is TCP. This is done to increase reliability > since router protocols allow them to drop UDP packets when they have a high > load. By the way, listening on port 63000 was a bug introduced between > 4.0.0 and 4.0.1 that has been fixed. > > David > On Mon, Apr 30, 2012 at 4:24 PM, Roy Dragseth > wrote: > On v3 and earlier pbs_server would listen to all interfaces per default, > but this seems to have changed under v4. > > Version 3.0.2 > # lsof -i | grep pbs > pbs_serve 17765 root 8u IPv4 383811137 TCP *:15001 (LISTEN) > pbs_serve 17765 root 10u IPv4 383811140 UDP *:15001 > pbs_serve 17765 root 11u IPv4 383811141 UDP *:1023 > > Version 4.0.1 (snapshot) > [root at hpc1 torque]# lsof -i | grep pbs > pbs_serve 21247 root 7u IPv4 14268817 0t0 TCP *:63000 (LISTEN) > pbs_serve 21247 root 8u IPv4 14268819 0t0 TCP > hpc1.cc.uit.no:15001 (LISTEN) > [root at hpc1 torque]# hostname > hpc1.cc.uit.no > > under torque 4 pbs_server listen only to the interface associated with the > hostname which makes it ignore all > internal traffic thus compute nodes cannot contact it. > > Is it possble to get the old behaviour back? Using the -H flag doesn't > help as it breaks other things when > communicating with the compute nodes. The only way to get torque 4 > functional was to set the frontend > hostname to the same as the one on the internal interface, hpc1.local, > which isn't a good solution. > > r. > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/8ff6c941/attachment.html From dbeer at adaptivecomputing.com Tue May 1 14:14:50 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 1 May 2012 14:14:50 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <5712849.CGKjQuDxcD@lux> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <1509369.qy0g8YhCvS@lux> <5712849.CGKjQuDxcD@lux> Message-ID: On Mon, Apr 30, 2012 at 2:39 PM, Roy Dragseth wrote: > ** > > No, problem. I should probably have used the bugzilla for these kind of > things anyway. > > > > I'll try to scramble together a more thorough bug report on this matter. > > > > BTW, is 4.0.X supposed to be config compatible with 3.0.X? That is, if > 3.0.X works fine, can I assume that the same config will work on 4.0.X. > Would any differences in behaviour indicate a bug is present? The reason > I'm asking is that I have a fairly solid setup for v3.0.4 in my torque-roll > for Rocks, but I'm having a hard time to get 4.0.X working in the same > environment. Being pressed for time I would like a hint on where to look > for problems. > > > You can't assume that the configuration will work exactly the same for 4.0.X. It depends on what your setup is. I recommend looking into the 4.0 docs on the website for differences. We kept things the same were possible/desirable, but where we wanted to make changes we did because this is a major release. David > Regards, > > r. > > > > > > On Monday 30. April 2012 09.59.39 David Beer wrote: > > Roy, > > > I'm sorry your message wasn't attended to - its the nature of a mailing > list that sometimes messages are lost. Can you (or Steve) describe the TM > connect issues in a bit more detail? We should get a bug report filed for > this. > > > David > > On Sun, Apr 29, 2012 at 11:03 AM, Roy Dragseth > wrote: > > I can confirm this too (reported the same problem to the mailing list a > month ago, but nobody seemed to care). Torque 3.0.4 works fine with exactly > the same system config. > > > > r. > > > > > > On Saturday 28. April 2012 03.21.05 DuChene, StevenX A wrote: > > I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just > today. > > Earlier today I was running the 4.0-fixes tree from 04/03 and I had the > same results. > > I was hoping the update to current sources would fix these problems but no > such luck. > > > > If I run the following: > > > > qsub -I -l nodes=7 -l arch=atomN570 > > > > from my pbs job submission host I get: > > > > qsub: waiting for job 4.login2.sep.here to start > > qsub: job 4.login2.sep.here ready > > > > and then I get a shell prompt on the node 0 of this job. > > > > If I then do: > > > > $ echo $PBS_NODEFILE > > /var/spool/torque/aux//4.login2.sep.here > > > > And then: > > > > $ cat /var/spool/torque/aux//4.login2.sep.here > > atom255 > > atom255 > > atom255 > > atom255 > > atom254 > > atom254 > > atom254 > > > > and then I try: > > > > $ pbsdsh -h atom254 ls /tmp > > pbsdsh: error from tm_poll() 17002 > > > > Alternatively if I use the ?v option it says: > > > > $ pbsdsh -v -h atom254 /bin/ls /tmp > > pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000) > > > > Then when I exit the shell I get back to my job submission node but when I > then check for the pbs_mom process on that node 0 for the job that I just > had the shell prompt on, the pbs_mom process is no longer running and the > status message from the init script says: > > > > pbs_mom dead but subsys locked > > > > Also if I run any mpi jobs through torque that are run on cpu cores on one > system, they seem to run just fine but for mpi jobs where the job spans > across separate nodes I get the following error out each time: > > > > $ cat script_viking_7nodes.pbs.e12 > > [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, > return status = 17002 > > -------------------------------------------------------------------------- > > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > > launch so we are aborting. > > > > There may be more information reported by the environment (see above). > > > > This may be because the daemon was unable to find all the needed shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun was unable to cleanly terminate the daemons on the nodes shown > > below. Additional manual cleanup may be required - please refer to > > the "orte-clean" tool for assistance. > > -------------------------------------------------------------------------- > > viking11.sep.here - daemon did not report back when launched > > > > The PBS job script for this MPI job is very simple: > > > > #PBS -l nodes=7:Viking:ppn=1 > > #PBS -l arch=xeon > > > > mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > > hello_out_viking7nodes > > > > Also if I ssh to any node in my cluster and create a mpi nodes file by > hand and then use that to run like this: > > > > mpirun --machinefile ../nodefile ./mpi_hello_hostname > > > > I get back all of the expected output from all of the nodes listed in the > manually created nodes file. > > -- > > Steven DuChene > > > > > > -- > > > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > > Roy Dragseth, Team Leader, High Performance Computing > > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > -- > > > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > > Roy Dragseth, Team Leader, High Performance Computing > > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/8f6f740d/attachment-0001.html From dbeer at adaptivecomputing.com Tue May 1 14:15:34 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 1 May 2012 14:15:34 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <1907106.CD0yMpC1j1@lux> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <5712849.CGKjQuDxcD@lux> <1907106.CD0yMpC1j1@lux> Message-ID: Steve and Roy, Thanks for this information. You can expect this bug to be fixed in 4.0.1. David On Mon, Apr 30, 2012 at 4:07 PM, Roy Dragseth wrote: > ** > > Here is a job excerpt demonstrating error messages when running pbsdsh. > > The pbs_mom thread die if you try to run pbsdsh -u. > > > > dmesg shows a segfault > > > > > > marve at hpc1 ~]$ qsub -I -lnodes=2:ppn=2,walltime=1000 > > qsub: waiting for job 13.hpc1.cc.uit.no to start > > qsub: job 13.hpc1.cc.uit.no ready > > > > [marve at compute-0-2 ~]$ pbsdsh uname -a > > Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 > 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux > > pbsdsh: Event poll failed, error TM_ENOTCONNECTED > > Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 > 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux > > Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 > 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux > > Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 > 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux > > pbsdsh: reconnected > > pbsdsh: Event poll failed, error TM_ENOTFOUND > > [marve at compute-0-2 ~]$ > > [marve at compute-0-2 ~]$ pbsdsh -u uname -a > > [marve at compute-0-2 ~]$ pbsdsh -u uname -a > > [marve at compute-0-2 ~]$ dmesg | tail -n1 > > pbs_mom[1980]: segfault at 20 ip 000000000040b240 sp 00007fffc1853820 > error 4 in pbs_mom[400000+5f000] > > [marve at compute-0-2 ~]$ logout > > > > qsub: job 13.hpc1.cc.uit.no completed > > > > This is using torque-4.0.1-snap.201204031702.tar.gz > > > > (the problems related to getting 4.0.X up an running seems to be related > to the fact that pbs_server now only listens to one interface, earlier it > listened to all interfaces. I'll post a separate report for this) > > > > r. > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/b5b1ded8/attachment.html From stevenx.a.duchene at intel.com Tue May 1 14:41:53 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 1 May 2012 20:41:53 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <5712849.CGKjQuDxcD@lux> <1907106.CD0yMpC1j1@lux> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF8060613C1@ORSMSX106.amr.corp.intel.com> Are you talking about the problem with pbs_mom dying or are you talking about the issue with the tm errors when trying run jobs between nodes? I am setting the loglevel on my pbs_mom daemons to 7 and then resubmitting the interactive job across those two specific nodes. The only thing I see in the mom logs is the following after ONLY the initial pbsdsh request: 05/01/2012 13:22:19;0001; pbs_mom;Job;26.login2.sep.here;cannot tm_reply to task 1 If I continue rerunning the same pbsdsh command after that between the two nodes, I do not see any further updates in the mom logs relating to my communication request. I see normal log entries but nothing indicating there was any additional requests for job communication or problems logged. My next thought is to look back at the pbs_server logs to see if somehow the problem is some sort of issue on the server instead of the client nodes. -- Steven DuChene From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer Sent: Tuesday, May 01, 2012 1:16 PM To: Torque Users Mailing List Subject: Re: [torqueusers] pbs_mom dies on exit of interactive session Steve and Roy, Thanks for this information. You can expect this bug to be fixed in 4.0.1. David On Mon, Apr 30, 2012 at 4:07 PM, Roy Dragseth > wrote: Here is a job excerpt demonstrating error messages when running pbsdsh. The pbs_mom thread die if you try to run pbsdsh -u. dmesg shows a segfault marve at hpc1 ~]$ qsub -I -lnodes=2:ppn=2,walltime=1000 qsub: waiting for job 13.hpc1.cc.uit.no to start qsub: job 13.hpc1.cc.uit.no ready [marve at compute-0-2 ~]$ pbsdsh uname -a Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux pbsdsh: Event poll failed, error TM_ENOTCONNECTED Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux Linux compute-0-2.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux Linux compute-0-0.local 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux pbsdsh: reconnected pbsdsh: Event poll failed, error TM_ENOTFOUND [marve at compute-0-2 ~]$ [marve at compute-0-2 ~]$ pbsdsh -u uname -a [marve at compute-0-2 ~]$ pbsdsh -u uname -a [marve at compute-0-2 ~]$ dmesg | tail -n1 pbs_mom[1980]: segfault at 20 ip 000000000040b240 sp 00007fffc1853820 error 4 in pbs_mom[400000+5f000] [marve at compute-0-2 ~]$ logout qsub: job 13.hpc1.cc.uit.no completed This is using torque-4.0.1-snap.201204031702.tar.gz (the problems related to getting 4.0.X up an running seems to be related to the fact that pbs_server now only listens to one interface, earlier it listened to all interfaces. I'll post a separate report for this) r. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/d19fa86b/attachment-0001.html From bircoph at gmail.com Tue May 1 08:12:30 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Tue, 1 May 2012 18:12:30 +0400 Subject: [torqueusers] cpuset on Torque 2.4.16, CentOS 6.2, AMD Opteron Bulldozer Message-ID: <20120501181230.8d0592d9.bircoph@gmail.com> Hello, > On 31/03/12 09:28, David Gabriel Simas wrote: > > > mount -t cgroup -o cupset,noprefix X /sys/fs/cgroup/cpuset > > Or just: > > mount -t cpuset - /sys/fs/cgroup/cpuset I have the very same problem on Gentoo with 3.2.14 vanilla kernel and torque-3.0.4, but a solution above doesn't help. Any job fails to run because pbs_mom is unable to create a cpuset for a job, pbs_mom.log: 05/01/2012 04:09:11;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 05/01/2012 04:09:11;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Retry job exec failure, retry will be attempted (see syslog for more information) 05/01/2012 04:09:11;0001; pbs_mom;Job;5.master;ALERT: job failed phase 3 start 05/01/2012 04:09:11;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 5.master 05/01/2012 04:09:11;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 05/01/2012 04:09:11;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 05/01/2012 04:09:11;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 05/01/2012 04:09:11;0080; pbs_mom;Job;5.master;obit sent to server 05/01/2012 04:09:12;0080; pbs_mom;Job;5.master;removed job script And in syslog: May 01 04:09:11 [pbs_mom] LOG_ERROR::TMomFinalizeChild, Could not create cpuset for job 5.master /sys/fs/cgroup/cpuset and /dev/cpuset are both mounted as cpuset filesystem type: $ mount | egrep "cpuset|cgroup" cgroup_root on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755) openrc on /sys/fs/cgroup/openrc type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib64/rc/sh/cgroup-release-agent.sh,name=openrc) none on /dev/cpuset type cpuset (rw) - on /sys/fs/cgroup/cpuset type cpuset (rw) And their content is the same with "cpuset." prefix. It looks like this change was made in 3.0 kernel, I checked changlog only fluently, so I may be mistaken here; but at least in works on 2.6.38 and fails on 3.2.14 kernel. I wrote a simple patch to account path changes depending on the linux kernel version. I verified that with this patch tasks are running. I have not verified yet that cpuset resources are allocated properly by the scheduler. Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: torque-3.0.4-cpusets.patch Type: application/octet-stream Size: 5103 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/aa59c06b/attachment-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/aa59c06b/attachment-0001.bin From christina.salls at noaa.gov Tue May 1 11:49:02 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 1 May 2012 13:49:02 -0400 Subject: [torqueusers] python in qsub Message-ID: Hi all, My configuration consists of Torque 2.5.9 running on RHEL 6 on a single cluster. I have a user that is able to interactively run python and R, but when his job is submitted to Torque, it fails with this message: Traceback (most recent call last): File "../progs/Kriging_Batch.py", line 110, in import rpy2.robjects as robjects ImportError: No module named rpy2.robjects I asked him to include a printenv in his script and compared it to the environment variables that are inherent in his login. I could not tell from the comparison what might be causing this problem. His script looks like this: [root at zeus batch]# more runkrig.sh #!/bin/sh #submit with qsub runkrig.sh #PBS -N GaugesJob #PBS -l select=1:ncpus=1 #PBS -q hydrology #PBS -M tim.hunter at noaa.gov source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh intel64 cd /zeus/d2/users/hunter/lbrm/kriging/batch echo 'Begin Kriging of Met Data with python and R' date python ../progs/Kriging_Batch.py $1 $2 $3 echo 'Kriging operations complete!' date The output file looks like this: [root at zeus batch]# more GaugesJob.o786 Begin Kriging of Met Data with python and R Fri Apr 27 13:44:42 CDT 2012 Kriging operations complete! Fri Apr 27 13:44:42 CDT 2012 The Kriging_Batch.py script works just fine interactively. If I run the import command interactively, it also works. eg. python -c "import rpy2.robjects as robjects" I am sure there is a simple explanation, and if any of you have any clues to lead me in the right direction, I would greatly appreciate it. Even in the torque environment, the other python imports are working properly. It only seems to be choking on the rpy2 import. This is the one portion of the script #-------------------------------------------------------------------------------------- import os from os.path import normpath import sys import shutil import csv import rpy2.objects as robjects # r = robjects.r path = os.getcwd() print "curdir = %s" %path # oldPath = os.environ['PATH'].split(os.pathsep) newPath = os.environ['PATH'].split(os.pathsep) newPath = os.pathsep.join(newPath[len(oldPath):] + newPath[:len(oldPath)]) # # Load the R functions # print "load the R functions" r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData') # # Force stdout into an unbuffered mode # sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) Any advice, clues, hints, moral support, etc.... welcomed and appreciated. Thanks, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/7e945cb8/attachment-0001.html From knielson at adaptivecomputing.com Tue May 1 16:29:54 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 1 May 2012 16:29:54 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> Message-ID: On Fri, Apr 27, 2012 at 9:21 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > I am running torque-4.0.1 that I pulled from the svn 4.0.1 branch just > today.**** > > Earlier today I was running the 4.0-fixes tree from 04/03 and I had the > same results.**** > > I was hoping the update to current sources would fix these problems but no > such luck.**** > > ** ** > > If I run the following:**** > > ** ** > > qsub -I -l nodes=7 -l arch=atomN570**** > > ** ** > > from my pbs job submission host I get:**** > > ** ** > > qsub: waiting for job 4.login2.sep.here to start**** > > qsub: job 4.login2.sep.here ready**** > > ** ** > > and then I get a shell prompt on the node 0 of this job.**** > > ** ** > > If I then do:**** > > ** ** > > $ echo $PBS_NODEFILE**** > > /var/spool/torque/aux//4.login2.sep.here**** > > ** ** > > And then:**** > > ** ** > > $ cat /var/spool/torque/aux//4.login2.sep.here**** > > atom255**** > > atom255**** > > atom255**** > > atom255**** > > atom254**** > > atom254**** > > atom254**** > > ** ** > > and then I try:**** > > ** ** > > $ pbsdsh -h atom254 ls /tmp**** > > pbsdsh: error from tm_poll() 17002**** > > ** ** > > Alternatively if I use the ?v option it says:**** > > ** ** > > $ pbsdsh -v -h atom254 /bin/ls /tmp**** > > pbsdsh: tm_init failed, rc = TM_ESYSTEM (17000)**** > > ** ** > > Then when I exit the shell I get back to my job submission node but when I > then check for the pbs_mom process on that node 0 for the job that I just > had the shell prompt on, the pbs_mom process is no longer running and the > status message from the init script says:**** > > ** ** > > pbs_mom dead but subsys locked**** > > ** ** > > Also if I run any mpi jobs through torque that are run on cpu cores on one > system, they seem to run just fine but for mpi jobs where the job spans > across separate nodes I get the following error out each time:**** > > ** ** > > $ cat script_viking_7nodes.pbs.e12**** > > [viking12.sep.here:12360] plm:tm: failed to poll for a spawned daemon, > return status = 17002**** > > -------------------------------------------------------------------------- > **** > > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to* > *** > > launch so we are aborting.**** > > ** ** > > There may be more information reported by the environment (see above).**** > > ** ** > > This may be because the daemon was unable to find all the needed shared*** > * > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > **** > > location of the shared libraries on the remote nodes and this will**** > > automatically be forwarded to the remote nodes.**** > > -------------------------------------------------------------------------- > **** > > -------------------------------------------------------------------------- > **** > > mpirun noticed that the job aborted, but has no info as to the process**** > > that caused that situation.**** > > -------------------------------------------------------------------------- > **** > > -------------------------------------------------------------------------- > **** > > mpirun was unable to cleanly terminate the daemons on the nodes shown**** > > below. Additional manual cleanup may be required - please refer to**** > > the "orte-clean" tool for assistance.**** > > -------------------------------------------------------------------------- > **** > > viking11.sep.here - daemon did not report back when launched**** > > ** ** > > The PBS job script for this MPI job is very simple:**** > > ** ** > > #PBS -l nodes=7:Viking:ppn=1**** > > #PBS -l arch=xeon**** > > ** ** > > mpirun --machinefile $PBS_NODEFILE /home/sadX/mpi_test/mpi_hello_hostname > > hello_out_viking7nodes**** > > ** ** > > Also if I ssh to any node in my cluster and create a mpi nodes file by > hand and then use that to run like this:**** > > ** ** > > mpirun --machinefile ../nodefile ./mpi_hello_hostname**** > > ** ** > > I get back all of the expected output from all of the nodes listed in the > manually created nodes file.**** > > --**** > > Steven DuChene**** > > > Steve, This issues are fixed and checked into the 4.0.1 branch and 4.0-fixes if you want to try them out. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/e0b938dd/attachment.html From stevenx.a.duchene at intel.com Tue May 1 16:37:01 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 1 May 2012 22:37:01 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session In-Reply-To: References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF806061492@ORSMSX106.amr.corp.intel.com> >From Ken Nielson: Steve, This issues are fixed and checked into the 4.0.1 branch and 4.0-fixes if you want to try them out. Ken Ok, I will pull the latest code and give it a shot. I will let you know what I find out. -- Steven DuChene From stevenx.a.duchene at intel.com Tue May 1 18:26:26 2012 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Wed, 2 May 2012 00:26:26 +0000 Subject: [torqueusers] pbs_mom dies on exit of interactive session (SOLVED!!!) In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF806061492@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF806061492@ORSMSX106.amr.corp.intel.com> Message-ID: <560DBE57F33C4C4C9FBF11C662951AF8060614EF@ORSMSX106.amr.corp.intel.com> AH HA! It WORKED!!! MPI hello world run through interactive job: mpirun --machinefile $PBS_NODEFILE ./mpi_hello_hostname Hello world! I am process number: 0 on host supatom03.sep.here Hello world! I am process number: 1 on host supatom04.sep.here Pbsdsh command run during interactive job: $ pbsdsh -v -h supatom04.sep.here ls /tmp pbsdsh(): rescinfo from 0: Linux supatom03.sep.here 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64:nodes=1:dotest+1:nottest,walltime=01:00:00 pbsdsh(): rescinfo from 1: Linux supatom04.sep.here 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64:nodes=1:dotest+1:nottest,walltime=01:00:00 pbsdsh(): spawned task 1 yum_save_tx-2012-04-27-14-03GPU64R.yumtx pbsdsh(): spawn event returned: 1 (1 spawns and 0 obits outstanding) pbsdsh(): sending obit for task 3 pbsdsh(): obit event returned: 1 (0 spawns and 1 obits outstanding) pbsdsh(): task 1 exit status 0 And my simple MPI hello world program FINALLY gave me the following output when run through torque: Hello world! I am process number: 0 on host supatom08.sep.here Hello world! I am process number: 1 on host supatom03.sep.here Hello world! I am process number: 2 on host supatom04.sep.here Hello world! I am process number: 4 on host supatom06.sep.here Hello world! I am process number: 3 on host supatom05.sep.here Hello world! I am process number: 5 on host supatom07.sep.here Hello world! I am process number: 6 on host supatom02.sep.here YAHOO!!! Thanks all -- Steven DuChene From knielson at adaptivecomputing.com Tue May 1 20:50:09 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 1 May 2012 20:50:09 -0600 Subject: [torqueusers] pbs_mom dies on exit of interactive session (SOLVED!!!) In-Reply-To: <560DBE57F33C4C4C9FBF11C662951AF8060614EF@ORSMSX106.amr.corp.intel.com> References: <560DBE57F33C4C4C9FBF11C662951AF806059D65@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF806061492@ORSMSX106.amr.corp.intel.com> <560DBE57F33C4C4C9FBF11C662951AF8060614EF@ORSMSX106.amr.corp.intel.com> Message-ID: On Tue, May 1, 2012 at 6:26 PM, DuChene, StevenX A < stevenx.a.duchene at intel.com> wrote: > AH HA! > > It WORKED!!! > > MPI hello world run through interactive job: > > mpirun --machinefile $PBS_NODEFILE ./mpi_hello_hostname > Hello world! I am process number: 0 on host supatom03.sep.here > Hello world! I am process number: 1 on host supatom04.sep.here > > Pbsdsh command run during interactive job: > > $ pbsdsh -v -h supatom04.sep.here ls /tmp > pbsdsh(): rescinfo from 0: Linux supatom03.sep.here > 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 > x86_64:nodes=1:dotest+1:nottest,walltime=01:00:00 > pbsdsh(): rescinfo from 1: Linux supatom04.sep.here > 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 > x86_64:nodes=1:dotest+1:nottest,walltime=01:00:00 > pbsdsh(): spawned task 1 > yum_save_tx-2012-04-27-14-03GPU64R.yumtx > pbsdsh(): spawn event returned: 1 (1 spawns and 0 obits outstanding) > pbsdsh(): sending obit for task 3 > pbsdsh(): obit event returned: 1 (0 spawns and 1 obits outstanding) > pbsdsh(): task 1 exit status 0 > > And my simple MPI hello world program FINALLY gave me the following output > when run through torque: > > Hello world! I am process number: 0 on host supatom08.sep.here > Hello world! I am process number: 1 on host supatom03.sep.here > Hello world! I am process number: 2 on host supatom04.sep.here > Hello world! I am process number: 4 on host supatom06.sep.here > Hello world! I am process number: 3 on host supatom05.sep.here > Hello world! I am process number: 5 on host supatom07.sep.here > Hello world! I am process number: 6 on host supatom02.sep.here > > YAHOO!!! > > Thanks all > -- > Steven DuChene > > And at Adaptive Computing there was much rejoicing. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120501/1be64e9e/attachment-0001.html From Gareth.Williams at csiro.au Tue May 1 22:12:45 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 2 May 2012 14:12:45 +1000 Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper Message-ID: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> Hi All, I just figured this out after debugging some problems with left over processes from intel mpi and you may find it useful Recent versions of intel mpi use a variable I_MPI_MPD_RSH to specify the program (ssh) used to launch the mpd processes (maybe old versions have this too). The program needs to return some info on stdout. I had to modify an existing pbsdsh wrapper to add the pbsdsh -o option to make it all work. The wrapper is: ~/sysadmin/ascutils/common> cat pbsssh #!/bin/bash # $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ # $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $ usage="usage: $0 " #swallow -x -n and -q (for intel mpi) while getopts "xqn" opt do : done shift $((OPTIND-1)) if [ $# -lt 2 ] then echo $usage exit fi node=$1 shift exec pbsdsh -o -h $node "$@" And the before and after behavior is illustrated here (note that the mpi tasks now run in a cpuset under the control of torque): wil240 at n001:~> mpirun -n 2 sh -c 'echo -n `hostname` | cat - /proc/$$/cpuset' n002/ n001/torque/1092920.burnet-srv.idpx.hpsc.csiro.au wil240 at n001:~> export I_MPI_MPD_RSH=pbsssh wil240 at n001:~> mpirun -n 2 sh -c 'echo -n `hostname` | cat - /proc/$$/cpuset' n001/torque/1092920.burnet-srv.idpx.hpsc.csiro.au n002/torque/1092920.burnet-srv.idpx.hpsc.csiro.au We'll be updating out intel-mpi environment module to set I_MPI_MPD_RSH=pbsssh feedback is welcome. Gareth -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120502/3d546a62/attachment.html From roy.dragseth at cc.uit.no Wed May 2 01:18:22 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Wed, 02 May 2012 09:18:22 +0200 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: References: <1367310.7tLNPAkoDo@lux> Message-ID: <5100794.TS57uori0o@lux> OK, I will not shed a tear over UDP reliance going away in torque;-) But, as you see from the lsof output, pbs_server also listen for TCP connections on port 15001 on all interfaces in versions 3 and below. Version 4 does not, is it possible to fix this? Maybe it would be possible to make -H take a list of names, or maybe accept a * argument? r. On Tuesday 1. May 2012 10.15.36 David Beer wrote: Roy, TORQUE 4 intentionally doesn't listen for UDP communication because all communication in TORQUE 4.0 is TCP. This is done to increase reliability since router protocols allow them to drop UDP packets when they have a high load. By the way, listening on port 63000 was a bug introduced between 4.0.0 and 4.0.1 that has been fixed. David On Mon, Apr 30, 2012 at 4:24 PM, Roy Dragseth wrote: On v3 and earlier pbs_server would listen to all interfaces per default, but this seems to have changed under v4. Version 3.0.2 # lsof -i | grep pbs pbs_serve 17765 root 8u IPv4 383811137 TCP *:15001 (LISTEN) pbs_serve 17765 root 10u IPv4 383811140 UDP *:15001 pbs_serve 17765 root 11u IPv4 383811141 UDP *:1023 Version 4.0.1 (snapshot) [root at hpc1 torque]# lsof -i | grep pbs pbs_serve 21247 root 7u IPv4 14268817 0t0 TCP *:63000 (LISTEN) pbs_serve 21247 root 8u IPv4 14268819 0t0 TCP hpc1.cc.uit.no:15001 (LISTEN) [root at hpc1 torque]# hostname hpc1.cc.uit.no under torque 4 pbs_server listen only to the interface associated with the hostname which makes it ignore all internal traffic thus compute nodes cannot contact it. Is it possble to get the old behaviour back? Using the -H flag doesn't help as it breaks other things when communicating with the compute nodes. The only way to get torque 4 functional was to set the frontend hostname to the same as the one on the internal interface, hpc1.local, which isn't a good solution. r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120502/085deeeb/attachment-0001.html From roy.dragseth at cc.uit.no Wed May 2 03:35:17 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Wed, 02 May 2012 11:35:17 +0200 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: <5100794.TS57uori0o@lux> References: <1367310.7tLNPAkoDo@lux> <5100794.TS57uori0o@lux> Message-ID: <1475541.MNZxbqLmuT@newton.cc.uit.no> Just to elaborate on what I'm struggling with right now. If I make pbs_server listen on the private network port, eth0, by using -H hpc1.local, then jobs get rejected unless I set the hostname to the internal name: # hostname hpc1.cc.uit.no # su - marve [marve at hpc1 ~]$ echo uname -a | qsub -lnodes=1,walltime=100 qsub: submit error (Bad UID for job execution MSG=ruserok failed validating marve/marve from hpc1.cc.uit.no) [marve at hpc1 ~]$ logout [root at hpc1 ~]# hostname hpc1.local [root at hpc1 ~]# su - marve [marve at hpc1 ~]$ echo uname -a | qsub -lnodes=1,walltime=100 2.hpc1.local So the job get rejected by pbs_mom because hpc1.cc.uit.no isn't authorized to send jobs it (or should we say her?). If I change the hostname from hpc1.cc.uit.no to hpc1.local everything is suddenly hunky-dory. Any suggestions for fixes or workarounds? Changing the frontend hostname isn't feasible on a general basis as the default in Rocks is that the hostname should be the FQDN. r. On Wednesday, May 02, 2012 09:18:22 Roy Dragseth wrote: OK, I will not shed a tear over UDP reliance going away in torque;-) But, as you see from the lsof output, pbs_server also listen for TCP connections on port 15001 on all interfaces in versions 3 and below. Version 4 does not, is it possible to fix this? Maybe it would be possible to make -H take a list of names, or maybe accept a * argument? r. On Tuesday 1. May 2012 10.15.36 David Beer wrote: Roy, TORQUE 4 intentionally doesn't listen for UDP communication because all communication in TORQUE 4.0 is TCP. This is done to increase reliability since router protocols allow them to drop UDP packets when they have a high load. By the way, listening on port 63000 was a bug introduced between 4.0.0 and 4.0.1 that has been fixed. David On Mon, Apr 30, 2012 at 4:24 PM, Roy Dragseth wrote: On v3 and earlier pbs_server would listen to all interfaces per default, but this seems to have changed under v4. Version 3.0.2 # lsof -i | grep pbs pbs_serve 17765 root 8u IPv4 383811137 TCP *:15001 (LISTEN) pbs_serve 17765 root 10u IPv4 383811140 UDP *:15001 pbs_serve 17765 root 11u IPv4 383811141 UDP *:1023 Version 4.0.1 (snapshot) [root at hpc1 torque]# lsof -i | grep pbs pbs_serve 21247 root 7u IPv4 14268817 0t0 TCP *:63000 (LISTEN) pbs_serve 21247 root 8u IPv4 14268819 0t0 TCP hpc1.cc.uit.no:15001 (LISTEN) [root at hpc1 torque]# hostname hpc1.cc.uit.no under torque 4 pbs_server listen only to the interface associated with the hostname which makes it ignore all internal traffic thus compute nodes cannot contact it. Is it possble to get the old behaviour back? Using the -H flag doesn't help as it breaks other things when communicating with the compute nodes. The only way to get torque 4 functional was to set the frontend hostname to the same as the one on the internal interface, hpc1.local, which isn't a good solution. r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- David Beer | Software Engineer Adaptive Computing -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120502/0b72bfea/attachment.html From gabe at msi.umn.edu Wed May 2 06:01:00 2012 From: gabe at msi.umn.edu (Gabe Turner) Date: Wed, 2 May 2012 07:01:00 -0500 Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper In-Reply-To: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> Message-ID: <20120502120100.GA32261@blackice.msi.umn.edu> On Wed, May 02, 2012 at 02:12:45PM +1000, Gareth.Williams at csiro.au wrote: > Hi All, > > I just figured this out after debugging some problems with left over processes from intel mpi and you may find it useful > > Recent versions of intel mpi use a variable I_MPI_MPD_RSH to specify the > program (ssh) used to launch the mpd processes (maybe old versions have > this too). The program needs to return some info on stdout. I had to > modify an existing pbsdsh wrapper to add the pbsdsh -o option to make it > all work. > Have you done much in the way of scalability testing using this wrapper? I wrote such a wrapper a few years ago when we were running, I think, Torque 2.3. I found I could not get pbsdsh to scale well beyond about 256 nodes. Beyond that, pbsdsh connections would begin timing out. Perhaps this will be more feasible with Torque 4 using tcp throughout. Did you consider crafting a wrapper that would ssh and spawn the mpd using pbs_track? -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From roy.dragseth at cc.uit.no Wed May 2 06:09:31 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Wed, 02 May 2012 14:09:31 +0200 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: <1475541.MNZxbqLmuT@newton.cc.uit.no> References: <1367310.7tLNPAkoDo@lux> <5100794.TS57uori0o@lux> <1475541.MNZxbqLmuT@newton.cc.uit.no> Message-ID: <2653322.xsCZMkWFhs@newton.cc.uit.no> On Wednesday, May 02, 2012 11:35:17 Roy Dragseth wrote: > > [marve at hpc1 ~]$ echo uname -a | qsub -lnodes=1,walltime=100 qsub: submit error (Bad UID for job execution MSG=ruserok failed validating marve/marve from hpc1.cc.uit.no) > Ok, I was barking up the wrong tree here. The error message came from pbs_server not understanding that hpc1.local and hpc1.cc.uit.no is the same host. Setting server_name to hpc1.local and putting hpc1.cc.uit.no into /etc/hosts.equiv seems to do the trick. So to summarize, here is what seems to be needed to get Torque 4 going on a multihomed cluster frontend where hostname resolves to the public interface. System config: Rocks (aka CentOS6) eth1, public network. FQDN = hpc1.cc.uit.no eth0, private network. Name = hpc1.local pbs_server side setup: Server hostname is FQDN, hpc1.cc.uit.no Start server with pbs_server -H hpc1.local Do normal config stuff + qmgr -c "set server server_name = hpc1.local" (this seems to be needed for pbs_mom to know where to send back status etc.) Add hpc1.cc.uit.no to /etc/hosts.equiv pbs_mom side setup set $pbsserver to hpc1.local in mom_priv/config That's about it. IMO, the hosts.equiv stuff is rather ugly, so any hints on how to avoid that is greatly appreciated. (hopefully this will be today's last reply to my own post...) r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From djohnson at osc.edu Wed May 2 06:15:35 2012 From: djohnson at osc.edu (Doug Johnson) Date: Wed, 02 May 2012 08:15:35 -0400 Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper In-Reply-To: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> Message-ID: Hi Gareth, For Intel MPI, you also have the option of using OSC mpiexec. Just use '-comm pmi', which is the same startup mechanism as MPICH2. When using newer versions of the Intel MPI library with OSC mpiexec, just make sure to have I_MPI_PMI_EXTENSIONS=on. A caveat for using OSC mpiexec is that the startup programs included with Intel MPI add extra environment variables that influences runtime behavior of the MPI program. These may be useful. We have a similar script as yours below that we use with Platform/HP MPI. See attached file. It copes with the fact that some packages use IP addresses on the rsh command line. Perhaps the correct solution is to create a proper pbsrsh that accepts the same command line flags as rsh, and is more robust. This would also allow MPI libraries such as Intel's and Platform's to have better Torque integration, with no Torque library dependencies, and without relying on external programs such as OSC mpiexec. Doug -------------- next part -------------- A non-text attachment was scrubbed... Name: pbsrsh Type: application/octet-stream Size: 2279 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120502/8b4d93b2/attachment.obj -------------- next part -------------- At Wed, 2 May 2012 14:12:45 +1000, wrote: > > [1 ] > [1.1 ] > > [1.2 ] > Hi All, > > I just figured this out after debugging some problems with left over processes from intel mpi and you > may find it useful > > Recent versions of intel mpi use a variable I_MPI_MPD_RSH to specify the program (ssh) used to launch > the mpd processes (maybe old versions have this too). The program needs to return some info on stdout. > I had to modify an existing pbsdsh wrapper to add the pbsdsh ?o option to make it all work. > > The wrapper is: > > ~/sysadmin/ascutils/common> cat pbsssh > > #!/bin/bash > > # $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ > > # $HeadURL: svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh $ > > usage="usage: $0 " > > #swallow -x -n and -q (for intel mpi) > > while getopts "xqn" opt > > do > > : > > done > > shift $((OPTIND-1)) > > if [ $# -lt 2 ] > > then > > echo $usage > > exit > > fi > > node=$1 > > shift > > exec pbsdsh -o -h $node "$@" > > And the before and after behavior is illustrated here (note that the mpi tasks now run in a cpuset > under the control of torque): > > wil240 at n001:~> mpirun -n 2 sh -c 'echo -n `hostname` | cat - /proc/$$/cpuset' > > n002/ > > n001/torque/1092920.burnet-srv.idpx.hpsc.csiro.au > > wil240 at n001:~> export I_MPI_MPD_RSH=pbsssh > > wil240 at n001:~> mpirun -n 2 sh -c 'echo -n `hostname` | cat - /proc/$$/cpuset' > > n001/torque/1092920.burnet-srv.idpx.hpsc.csiro.au > > n002/torque/1092920.burnet-srv.idpx.hpsc.csiro.au > > We?ll be updating out intel-mpi environment module to set I_MPI_MPD_RSH=pbsssh > > feedback is welcome. > > Gareth > > > [2 ] > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Wed May 2 06:54:30 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 2 May 2012 22:54:30 +1000 Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper In-Reply-To: <20120502120100.GA32261@blackice.msi.umn.edu> References: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> <20120502120100.GA32261@blackice.msi.umn.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C62010503312AAE@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Gabe Turner [mailto:gabe at msi.umn.edu] > Sent: Wednesday, 2 May 2012 10:01 PM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] Starting intel mpi tasks in torque - pbsssh > wrapper > > On Wed, May 02, 2012 at 02:12:45PM +1000, Gareth.Williams at csiro.au > wrote: > > Hi All, > > > > I just figured this out after debugging some problems with left over > processes from intel mpi and you may find it useful > > > > Recent versions of intel mpi use a variable I_MPI_MPD_RSH to specify > the > > program (ssh) used to launch the mpd processes (maybe old versions > have > > this too). The program needs to return some info on stdout. I had > to > > modify an existing pbsdsh wrapper to add the pbsdsh -o option to make > it > > all work. > > > > Have you done much in the way of scalability testing using this > wrapper? I > wrote such a wrapper a few years ago when we were running, I think, > Torque > 2.3. I found I could not get pbsdsh to scale well beyond about 256 > nodes. > Beyond that, pbsdsh connections would begin timing out. Perhaps this > will > be more feasible with Torque 4 using tcp throughout. > > Did you consider crafting a wrapper that would ssh and spawn the mpd > using > pbs_track? > > > -- > Gabe Turner Hi Gabe, I haven't had to worry about scalability to 256 nodes yet. Not sure if I should be pleased about that or not. I was not aware of pbs_track! However, a quick check indicates that it will not place the spawned process in a torque managed cpuset - otherwise it seemed a great idea. The following captured info shows only the first process being in a cpuset: mpirun -n 2 sh -c 'pbs_track -j $PBS_JOBID -- cat /proc/self/cpuset' /torque/1093068.burnet-srv.idpx.hpsc.csiro.au / Maybe pbs_track could be enhanced. Or pbsdsh could be enhanced. It would be nice if the 4 series scalability changes improve pbsdsh for this use case too. Maybe someone can comment. Regards, Gareth From akohlmey at cmm.chem.upenn.edu Wed May 2 08:22:01 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Wed, 2 May 2012 10:22:01 -0400 Subject: [torqueusers] python in qsub In-Reply-To: References: Message-ID: christina, have you checked whether the problem isn't with R instead of python? although the PYTHONPATH environment variable would be the primary suspect, but it could even by LD_LIBRARY_PATH. python is a bit strange in reporting certain kinds of error. axel. On Tue, May 1, 2012 at 1:49 PM, Christina Salls wrote: > Hi all, > > ? ? ?My configuration consists of Torque 2.5.9 running on RHEL 6 on a single > cluster. ?I have a user that is able to interactively run python and R, but > when his job is submitted to Torque, it fails with this message: > > Traceback (most recent call last): > > ? File "../progs/Kriging_Batch.py", line 110, in > > ??? import rpy2.robjects as robjects > > ImportError: No module named rpy2.robjects > > > I asked him to include a printenv in his script and compared it to the > environment variables that are inherent in his login. ?I could not tell from > the comparison what might be causing this problem. > > > His script looks like this: > > > [root at zeus batch]# more runkrig.sh > > #!/bin/sh > > #submit with qsub runkrig.sh > > #PBS -N GaugesJob > > #PBS -l select=1:ncpus=1 > > #PBS -q hydrology > > #PBS -M tim.hunter at noaa.gov > > source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh intel64 > > > > cd /zeus/d2/users/hunter/lbrm/kriging/batch > > > > echo 'Begin Kriging of Met Data with python and R' > > date > > python ../progs/Kriging_Batch.py $1 $2 $3 > > echo 'Kriging operations complete!' > > date > > > The output file looks like this: > > > [root at zeus batch]# more GaugesJob.o786 > > Begin Kriging of Met Data with python and R > > Fri Apr 27 13:44:42 CDT 2012 > > Kriging operations complete! > > Fri Apr 27 13:44:42 CDT 2012 > > > > The Kriging_Batch.py script works just fine interactively. ?If I run the > import command interactively, it also works. > > > eg. > > python -c "import rpy2.robjects as robjects" > > > I am sure there is a simple explanation, and if any of you have any clues to > lead me in the right direction, I would greatly appreciate it. > > > Even in the torque environment, the other python imports are working > properly. ?It only seems to be choking on the rpy2 import. > > > This is the one portion of the script > > > #-------------------------------------------------------------------------------------- > > import os > > from os.path import normpath > > import sys > > import shutil > > import csv > > > import rpy2.objects as robjects > > > # > > r = robjects.r > > path = os.getcwd() > > print "curdir = %s" %path > > > # > > oldPath = os.environ['PATH'].split(os.pathsep) > > newPath = os.environ['PATH'].split(os.pathsep) > > newPath = os.pathsep.join(newPath[len(oldPath):] + newPath[:len(oldPath)]) > > > # > > # Load the R functions > > # > > print "load the R functions" > > r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData') > > > # > > # Force stdout into an unbuffered mode > > # > > sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > > > > Any advice, clues, hints, moral support, etc.... welcomed and appreciated. > > > Thanks, > > > ? ? ? ?Christina > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From jjc at iastate.edu Wed May 2 09:28:16 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Wed, 2 May 2012 15:28:16 +0000 Subject: [torqueusers] python in qsub In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> Christina, I am going to assume that by interactively, you mean just by running the script or the command rather than using qsub -I If it is the latter, the comments below will not apply. Assuming that your site's policies allow you to run as the user in question: (Mine do as long as I am acting on a request for help from the user, though I always ask if it OK first.) Try ssh'ing into the compute node that had the problem as the user in question, then run interactively there, you should see the same problem. As root, you can find the node that ran the job by issuing tracejob -n 7 786 if the job was run with the last 7 days. The node on which the job launched would be the first node in the nodelist. Also as root on the head node, you can issue su - tim.hunter to become the user on the head node, then ssh into the node in question as the user, and try the script interactively. As Alex said, it could be a ldconfig or LD_LIBRARY_PATH issue, or you could have variables set in the /etc/profile.d scripts or /etc/login.csh of the user is a csh user. (Is the home directory shared so that any rc files are run the same on all nodes.) As a workaround, you could see if qsub -V works since that will pass the environment. If it does, then some env variables are being set by the user or by a script that runs only on the head node. Also, is /bin/sh the user's normal login shell? >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Axel Kohlmeyer >Sent: Wednesday, May 02, 2012 9:22 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] python in qsub > >christina, > >have you checked whether the problem isn't with R instead of python? >although the PYTHONPATH environment variable would be the primary >suspect, but it could even by LD_LIBRARY_PATH. python is a bit >strange in reporting certain kinds of error. > >axel. > >On Tue, May 1, 2012 at 1:49 PM, Christina Salls > wrote: >> Hi all, >> >> ? ? ?My configuration consists of Torque 2.5.9 running on RHEL 6 >on a >> single cluster. ?I have a user that is able to interactively run >> python and R, but when his job is submitted to Torque, it fails >with this message: >> >> Traceback (most recent call last): >> >> ? File "../progs/Kriging_Batch.py", line 110, in >> >> ??? import rpy2.robjects as robjects >> >> ImportError: No module named rpy2.robjects >> >> >> I asked him to include a printenv in his script and compared it to >the >> environment variables that are inherent in his login. ?I could not >> tell from the comparison what might be causing this problem. >> >> >> His script looks like this: >> >> >> [root at zeus batch]# more runkrig.sh >> >> #!/bin/sh >> >> #submit with qsub runkrig.sh >> >> #PBS -N GaugesJob >> >> #PBS -l select=1:ncpus=1 >> >> #PBS -q hydrology >> >> #PBS -M tim.hunter at noaa.gov >> >> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh >> intel64 >> >> >> >> cd /zeus/d2/users/hunter/lbrm/kriging/batch >> >> >> >> echo 'Begin Kriging of Met Data with python and R' >> >> date >> >> python ../progs/Kriging_Batch.py $1 $2 $3 >> >> echo 'Kriging operations complete!' >> >> date >> >> >> The output file looks like this: >> >> >> [root at zeus batch]# more GaugesJob.o786 >> >> Begin Kriging of Met Data with python and R >> >> Fri Apr 27 13:44:42 CDT 2012 >> >> Kriging operations complete! >> >> Fri Apr 27 13:44:42 CDT 2012 >> >> >> >> The Kriging_Batch.py script works just fine interactively. ?If I >run >> the import command interactively, it also works. >> >> >> eg. >> >> python -c "import rpy2.robjects as robjects" >> >> >> I am sure there is a simple explanation, and if any of you have >any >> clues to lead me in the right direction, I would greatly >appreciate it. >> >> >> Even in the torque environment, the other python imports are >working >> properly. ?It only seems to be choking on the rpy2 import. >> >> >> This is the one portion of the script >> >> >> #----------------------------------------------------------------- >---- >> ----------------- >> >> import os >> >> from os.path import normpath >> >> import sys >> >> import shutil >> >> import csv >> >> >> import rpy2.objects as robjects >> >> >> # >> >> r = robjects.r >> >> path = os.getcwd() >> >> print "curdir = %s" %path >> >> >> # >> >> oldPath = os.environ['PATH'].split(os.pathsep) >> >> newPath = os.environ['PATH'].split(os.pathsep) >> >> newPath = os.pathsep.join(newPath[len(oldPath):] + >> newPath[:len(oldPath)]) >> >> >> # >> >> # Load the R functions >> >> # >> >> print "load the R functions" >> >> >r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData >') >> >> >> # >> >> # Force stdout into an unbuffered mode >> >> # >> >> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) >> >> >> >> Any advice, clues, hints, moral support, etc.... welcomed and >appreciated. >> >> >> Thanks, >> >> >> ? ? ? ?Christina >> >> >> >> >> -- >> Christina A. Salls >> GLERL Computer Group >> help.glerl at noaa.gov >> Help Desk x2127 >> Christina.Salls at noaa.gov >> Voice Mail 734-741-2446 >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > >-- >Dr. Axel Kohlmeyer? ? akohlmey at gmail.com >http://sites.google.com/site/akohlmey/ > >Institute for Computational Molecular Science Temple University, >Philadelphia PA, USA. >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Wed May 2 09:51:20 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 2 May 2012 09:51:20 -0600 Subject: [torqueusers] Configure pbs_server to listen to all interfaces (torque 4)? In-Reply-To: <2653322.xsCZMkWFhs@newton.cc.uit.no> References: <1367310.7tLNPAkoDo@lux> <5100794.TS57uori0o@lux> <1475541.MNZxbqLmuT@newton.cc.uit.no> <2653322.xsCZMkWFhs@newton.cc.uit.no> Message-ID: On Wed, May 2, 2012 at 6:09 AM, Roy Dragseth wrote: > On Wednesday, May 02, 2012 11:35:17 Roy Dragseth wrote: > > > > [marve at hpc1 ~]$ echo uname -a | qsub -lnodes=1,walltime=100 > qsub: submit error (Bad UID for job execution MSG=ruserok failed validating > marve/marve from hpc1.cc.uit.no) > > > > Ok, I was barking up the wrong tree here. The error message came from > pbs_server not understanding that hpc1.local and hpc1.cc.uit.no is the > same > host. > > Setting server_name to hpc1.local and putting hpc1.cc.uit.no into > /etc/hosts.equiv seems to do the trick. > > So to summarize, here is what seems to be needed to get Torque 4 going on a > multihomed cluster frontend where hostname resolves to the public > interface. > > System config: > Rocks (aka CentOS6) > eth1, public network. FQDN = hpc1.cc.uit.no > eth0, private network. Name = hpc1.local > > pbs_server side setup: > Server hostname is FQDN, hpc1.cc.uit.no > Start server with > pbs_server -H hpc1.local > > Do normal config stuff + > qmgr -c "set server server_name = hpc1.local" > (this seems to be needed for pbs_mom to know where to send back status > etc.) > > Add hpc1.cc.uit.no to /etc/hosts.equiv > > pbs_mom side setup > set $pbsserver to hpc1.local in mom_priv/config > > That's about it. > > IMO, the hosts.equiv stuff is rather ugly, so any hints on how to avoid > that is > greatly appreciated. > I'm glad that you got that issue resolved. We completely agree with not liking hosts.equiv, but for now its a necessary evil. We are hoping to add the option of using ssh keys in the near future and are looking at other ways to not use ruserok() and its companion functions. David > > (hopefully this will be today's last reply to my own post...) > > r. > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120502/dfd2e1c3/attachment.html From gus at ldeo.columbia.edu Wed May 2 16:37:22 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 02 May 2012 18:37:22 -0400 Subject: [torqueusers] python in qsub In-Reply-To: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> Message-ID: <4FA1B722.2090606@ldeo.columbia.edu> Hi Christina This may not fix the problem, but this line on the job script surprised me: #PBS -l select=1:ncpus=1 Why not #PBS -l nodes=1:ppn=1 [assuming this is a serial/non-parallel job]? As Axel and James suggested, there may be an issue with the user environment. It may be useful to do 'env' at the head node command prompt, and also inside the job script, to see if they match. If something important is missing, a simple fix is to add all needed environment variables to the user's .profile/.tcshrc file, but there are other ways [like environment modules]. Assuming the home directory is shared on the nodes, these files should be sourced when the job is launched. Or to use qsub -V as James suggested. Gus Correa On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > Christina, > > I am going to assume that by interactively, you mean > just by running the script or the command rather than > using qsub -I > If it is the latter, the comments below will not apply. > > > Assuming that your site's policies allow you to run as the user in question: (Mine do as long as I am acting on a > request for help from the user, though I always ask if > it OK first.) > > Try ssh'ing into the compute node that had the problem > as the user in question, then run interactively there, you should see the same problem. > > As root, you can find the node that ran the job by issuing > tracejob -n 7 786 > if the job was run with the last 7 days. The node > on which the job launched would be the first node > in the nodelist. > > Also as root on the head node, you can issue > su - tim.hunter > > to become the user on the head node, then ssh into the node in question as the user, and try the script interactively. > > > As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > issue, or you could have variables set in the /etc/profile.d > scripts or /etc/login.csh of the user is a csh user. > (Is the home directory shared so that any rc files are > run the same on all nodes.) > > As a workaround, you could see if qsub -V works since that > will pass the environment. If it does, then some env variables are being set by the user or by a script that > runs only on the head node. > > > > > > Also, is /bin/sh the user's normal login shell? > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Axel Kohlmeyer >> Sent: Wednesday, May 02, 2012 9:22 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] python in qsub >> >> christina, >> >> have you checked whether the problem isn't with R instead of python? >> although the PYTHONPATH environment variable would be the primary >> suspect, but it could even by LD_LIBRARY_PATH. python is a bit >> strange in reporting certain kinds of error. >> >> axel. >> >> On Tue, May 1, 2012 at 1:49 PM, Christina Salls >> wrote: >>> Hi all, >>> >>> My configuration consists of Torque 2.5.9 running on RHEL 6 >> on a >>> single cluster. I have a user that is able to interactively run >>> python and R, but when his job is submitted to Torque, it fails >> with this message: >>> >>> Traceback (most recent call last): >>> >>> File "../progs/Kriging_Batch.py", line 110, in >>> >>> import rpy2.robjects as robjects >>> >>> ImportError: No module named rpy2.robjects >>> >>> >>> I asked him to include a printenv in his script and compared it to >> the >>> environment variables that are inherent in his login. I could not >>> tell from the comparison what might be causing this problem. >>> >>> >>> His script looks like this: >>> >>> >>> [root at zeus batch]# more runkrig.sh >>> >>> #!/bin/sh >>> >>> #submit with qsub runkrig.sh >>> >>> #PBS -N GaugesJob >>> >>> #PBS -l select=1:ncpus=1 >>> >>> #PBS -q hydrology >>> >>> #PBS -M tim.hunter at noaa.gov >>> >>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh >>> intel64 >>> >>> >>> >>> cd /zeus/d2/users/hunter/lbrm/kriging/batch >>> >>> >>> >>> echo 'Begin Kriging of Met Data with python and R' >>> >>> date >>> >>> python ../progs/Kriging_Batch.py $1 $2 $3 >>> >>> echo 'Kriging operations complete!' >>> >>> date >>> >>> >>> The output file looks like this: >>> >>> >>> [root at zeus batch]# more GaugesJob.o786 >>> >>> Begin Kriging of Met Data with python and R >>> >>> Fri Apr 27 13:44:42 CDT 2012 >>> >>> Kriging operations complete! >>> >>> Fri Apr 27 13:44:42 CDT 2012 >>> >>> >>> >>> The Kriging_Batch.py script works just fine interactively. If I >> run >>> the import command interactively, it also works. >>> >>> >>> eg. >>> >>> python -c "import rpy2.robjects as robjects" >>> >>> >>> I am sure there is a simple explanation, and if any of you have >> any >>> clues to lead me in the right direction, I would greatly >> appreciate it. >>> >>> >>> Even in the torque environment, the other python imports are >> working >>> properly. It only seems to be choking on the rpy2 import. >>> >>> >>> This is the one portion of the script >>> >>> >>> #----------------------------------------------------------------- >> ---- >>> ----------------- >>> >>> import os >>> >>> from os.path import normpath >>> >>> import sys >>> >>> import shutil >>> >>> import csv >>> >>> >>> import rpy2.objects as robjects >>> >>> >>> # >>> >>> r = robjects.r >>> >>> path = os.getcwd() >>> >>> print "curdir = %s" %path >>> >>> >>> # >>> >>> oldPath = os.environ['PATH'].split(os.pathsep) >>> >>> newPath = os.environ['PATH'].split(os.pathsep) >>> >>> newPath = os.pathsep.join(newPath[len(oldPath):] + >>> newPath[:len(oldPath)]) >>> >>> >>> # >>> >>> # Load the R functions >>> >>> # >>> >>> print "load the R functions" >>> >>> >> r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData >> ') >>> >>> >>> # >>> >>> # Force stdout into an unbuffered mode >>> >>> # >>> >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) >>> >>> >>> >>> Any advice, clues, hints, moral support, etc.... welcomed and >> appreciated. >>> >>> >>> Thanks, >>> >>> >>> Christina >>> >>> >>> >>> >>> -- >>> Christina A. Salls >>> GLERL Computer Group >>> help.glerl at noaa.gov >>> Help Desk x2127 >>> Christina.Salls at noaa.gov >>> Voice Mail 734-741-2446 >>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> >> >> -- >> Dr. Axel Kohlmeyer akohlmey at gmail.com >> http://sites.google.com/site/akohlmey/ >> >> Institute for Computational Molecular Science Temple University, >> Philadelphia PA, USA. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From joel.schaerer at gmail.com Wed May 2 02:32:05 2012 From: joel.schaerer at gmail.com (Joel Schaerer) Date: Wed, 02 May 2012 10:32:05 +0200 Subject: [torqueusers] Reserving memory for a job Message-ID: <4FA0F105.1080807@gmail.com> Hello all, I have the following problem: some of my jobs tend to use a large amount of RAM. I don't want my nodes to start swapping, so the batch manager should avoid running too many of these jobs on one node, independently of the number of cores available. As an example, if "zalasta" has 16GB of RAM, only two of the following jobs should run concurrently, even if the machine has 8 cores: echo "sleep 120" | qsub -N test -lnodes=zalasta,mem=16gb Unfortunately, torque will run 8 jobs concurrently, even if there is not enough memory. So, is there a way to treat available RAM as a resource with PBS/Torque? Or should I approach the problem in another way? Thanks, joel PS: I have a qsub version 4.0.0, if that matters. From s.breedveld at erasmusmc.nl Wed May 2 07:17:38 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Wed, 02 May 2012 15:17:38 +0200 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources Message-ID: <4FA133F2.80303@erasmusmc.nl> Dear list, This may be a Maui issue, but the Maui list seems dead :( I am trying to setup a very basic Torque+Maui system. I am running a Torque cluster for a year now, and wanted to improve the scheduling with Maui. To this end, I installed a fresh test-system, with server and node on a single computer. Torque version: 2.4.16 Maui version: 3.3.1 uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux I was able to run (simple) jobs with the Torque scheduler. When I replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by: $ qsub -q batch test-script.sh where test-script.sh is nothing more than a 'sleep 1m' script. Checking the job: # checkjob -v 55 checking job 55 (RM job '55.testing.azr.nl') State: Idle EState: Deferred Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT WallTime: 00:00:00 of 6:00:00 SubmitTime: Thu Apr 5 13:21:33 (Time Queued Total: 00:00:32 Eligible: 00:00:01) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G NodeAccess: SHARED NodeCount: 1 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE job is deferred. Reason: NoResources (cannot create reservation for job '55' (intital reservation attempt) ) Holds: Defer (hold reason: NoResources) PE: 16.03 StartPriority: 1 cannot select job 55 for partition DEFAULT (job hold active) show that there are no resources available. The node is free, and unloaded: # checknode testing checking node testing.azr.nl State: Idle (in current state for 2:23:54) Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M Utilized Resources: SWAP: 149M Dedicated Resources: [NONE] Opsys: linux Arch: [NONE] Speed: 1.00 Load: 0.050 Network: [DEFAULT] Features: [NONE] Attributes: [Batch] Classes: [batch 2:2] Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%) Reservations: NOTE: no reservations on node When the job is added, maui.log shows this: 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0) 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) 04/05 13:21:34 INFO: processing node request line '1' 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan 21600 Idle 0 1333624893 [NONE] [NONE] [NONE] >= 0 >= 0 [1][ppn=1] 1333624894 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING 04/05 13:21:34 INFO: jobs detected: 12 04/05 13:21:34 MStatClearUsage(node,Active) 04/05 13:21:34 MClusterUpdateNodeState() 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) 04/05 13:21:34 INFO: job '40' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '41' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '42' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '44' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '45' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '47' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '48' Priority: 16 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '49' Priority: 12 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '52' Priority: 8 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '53' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '54' Priority: 60 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '55' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 MStatClearUsage([NONE],Active) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) 04/05 13:21:34 INFO: job '40' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '41' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '42' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '44' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '45' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '47' Priority: 22 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '48' Priority: 16 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '49' Priority: 12 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '52' Priority: 8 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '53' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '54' Priority: 60 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 INFO: job '55' Priority: 1 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0) 04/05 13:21:34 MStatClearUsage([NONE],Idle) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 MResDestroy(NULL) 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 04/05 13:21:34 MQueueScheduleRJobs(Q) 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in partition DEFAULT (1 Needed) 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) 04/05 13:21:34 MJobReserve(55,Priority) 04/05 13:21:34 ALERT: job 55 cannot run in any partition 04/05 13:21:34 ALERT: cannot create new reservation for job 55 (shape[1] 1) 04/05 13:21:34 ALERT: cannot create new reservation for job 55 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create reservation for job '55' (intital reservation attempt) ) 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600 seconds) 04/05 13:21:34 WARNING: cannot reserve priority job '55' Active Jobs------ ------------------ 04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: 2 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1 [EState: 1] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 [EState: 1] 04/05 13:21:34 MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 [EState: 1] 04/05 13:21:34 MSchedUpdateStats() 04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 seconds 04/05 13:21:34 MResUpdateStats() 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% active jobs: 0 of 2 (completed: 1) 04/05 13:21:34 MQueueCheckStatus() 04/05 13:21:34 MNodeCheckStatus() 04/05 13:21:34 MUClearChild(PID) 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds I think the relevant line is: 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in partition DEFAULT (1 Needed) but I have no idea how to make a feasible task for the job. I have tried queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem to have had effect. Torque config. I have tried setting different attributes to the queue properties, hoping that it would have some effect: # qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch Priority = 20 set queue batch max_running = 8 set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodect = 10 set queue batch resources_max.nodes = 2 set queue batch resources_min.ncpus = 0 set queue batch resources_default.mem = 2000mb set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.pvmem = 16000mb set queue batch resources_default.walltime = 06:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = testing.azr.nl set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 10 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 56 Maui configuration, untouched: # maui.cfg 3.3.1 SERVERHOST testing # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[TESTING] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Any ideas? Thanks in advance, Sebastiaan -- Sebastiaan Breedveld, MSc. Ph.D. student Erasmus MC - Daniel den Hoed Cancer Center Department of Radiation Oncology Groene Hilledijk 301 3075 EA Rotterdam The Netherlands Phone: +31 10 7042693 Room: Gs-20 From Dan.Kulinski at GlobalGeophysical.com Wed May 2 15:45:16 2012 From: Dan.Kulinski at GlobalGeophysical.com (Dan Kulinski) Date: Wed, 2 May 2012 21:45:16 +0000 Subject: [torqueusers] Job not receiving class in maui Message-ID: We have a few queues defined (30 to be exact) and we have strange behavior occurring on the newer queues we have created. When we submit the job and it is sent to maui there is no class defined on these newer queues. The reason we have so many is due to software that interfaces with the queues not being able to use long queue names when querying for status. So I have two queue configurations. They are exactly the same. When I submit to one, I get the proper node selection based on the resources_default.neednodes. On the other I do not. Here are the configurations: Working queue: # # Create queues and set their attributes. # # # Create and define queue largeb_multiple # create queue largeb_multiple set queue largeb_multiple queue_type = Execution set queue largeb_multiple resources_default.neednodes = 8P32M set queue largeb_multiple enabled = True set queue largeb_multiple started = True Non-working queue: # # Create queues and set their attributes. # # # Create and define queue lgB_multi # create queue lgB_multi set queue lgB_multi queue_type = Execution set queue lgB_multi resources_default.neednodes = 8P32M set queue lgB_multi enabled = True set queue lgB_multi started = True When I send a job to the queue this is what I see (via the diagnose command): Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features 32570 Running DEF 8 DEF 00:05:00 1 8 dank wheel - 00:00:03 [NONE] [NONE] [NONE] >=0 >=0 NC0 [largeb_multiple:1] [8P32M] 32571 Running DEF 2 DEF 00:05:00 1 2 dank wheel - 00:00:01 [NONE] [NONE] [NONE] >=0 >=0 NC0 [NONE] [NONE] As you can see, the job that is running correctly has the correct class. The other job shows up with NONE and features of NONE. Has anyone else ran into this? Any ideas? Thanks, Dan Kulinski -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120502/04ca2d71/attachment.html From sm4082 at nyu.edu Wed May 2 19:34:00 2012 From: sm4082 at nyu.edu (Sreedhar M) Date: Wed, 2 May 2012 21:34:00 -0400 Subject: [torqueusers] Reserving memory for a job In-Reply-To: <4FA0F105.1080807@gmail.com> References: <4FA0F105.1080807@gmail.com> Message-ID: Hi Joel, I don't know whether this is doable with torque but it is with Moab. NODEAVAILABILITYPOLICY DEDICATED:PROC DEDICATED:MEM This statement in moab.cfg file makes sure that jobs are fit on the same node by both memory as well as cores. I give some default memory for each job, in case user doesn't declare, through qsub wrapper. If you are not using Moab I don't know whether there is a simple solution through torque. Before I did this through Moab, what I did was to change ppn number based on the memory request and memory present on the node using qsub wrapper. The only problem with this is that if user requests way more than what he/she needs then we pretty much end up with idle cores, there by wasting resources. Even though it's the similar situation with Moab, I believe it handles the stuff better. More over with Moab you can notify users when their memory usage reaches certain percentage of what they declared. Sreedhar Manchu HPC Support Specialist ITS-Esystems/Research Services New York University, NY - 100012 On May 2, 2012, at 4:32 AM, Joel Schaerer wrote: > Hello all, > > I have the following problem: some of my jobs tend to use a large amount > of RAM. I don't want my nodes to start swapping, so the batch manager > should avoid running too many of these jobs on one node, independently > of the number of cores available. As an example, if "zalasta" has 16GB > of RAM, only two of the following jobs should run concurrently, even if > the machine has 8 cores: > > echo "sleep 120" | qsub -N test -lnodes=zalasta,mem=16gb > > Unfortunately, torque will run 8 jobs concurrently, even if there is not > enough memory. > > So, is there a way to treat available RAM as a resource with PBS/Torque? > Or should I approach the problem in another way? > > Thanks, > > joel > > PS: I have a qsub version 4.0.0, if that matters. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Wed May 2 20:21:45 2012 From: sm4082 at nyu.edu (Sreedhar M) Date: Wed, 2 May 2012 22:21:45 -0400 Subject: [torqueusers] Reserving memory for a job In-Reply-To: <4FA0F105.1080807@gmail.com> References: <4FA0F105.1080807@gmail.com> Message-ID: <28655B5E-36FC-4703-A336-955667E848CE@nyu.edu> Try this. qmgr -c 'set queue serial resources_default.pmem = 896mb' where serial the queue name and I gave 896mb for each core. Total number of cores is 8 and total memory on the node is 7.8GB. In this case, if you do #PBS -l nodes=1:ppn=1,mem=3584mb and submit three jobs it looks like only two jobs are running on the node. But I'm not sure whether scheduler is doing it or just torque. I think it's worth trying. For your case, you might want to give 15*1024/8 mb for each core. Sreedhar Manchu HPC Support Specialist ITS-Esystems/Research Services New York University, NY - 100012 On May 2, 2012, at 4:32 AM, Joel Schaerer wrote: > Hello all, > > I have the following problem: some of my jobs tend to use a large amount > of RAM. I don't want my nodes to start swapping, so the batch manager > should avoid running too many of these jobs on one node, independently > of the number of cores available. As an example, if "zalasta" has 16GB > of RAM, only two of the following jobs should run concurrently, even if > the machine has 8 cores: > > echo "sleep 120" | qsub -N test -lnodes=zalasta,mem=16gb > > Unfortunately, torque will run 8 jobs concurrently, even if there is not > enough memory. > > So, is there a way to treat available RAM as a resource with PBS/Torque? > Or should I approach the problem in another way? > > Thanks, > > joel > > PS: I have a qsub version 4.0.0, if that matters. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Wed May 2 22:17:21 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 3 May 2012 14:17:21 +1000 Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> Message-ID: <007DECE986B47F4EABF823C1FBB19C62010503312ABB@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Doug Johnson [mailto:djohnson at osc.edu] > Sent: Wednesday, 2 May 2012 10:16 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Starting intel mpi tasks in torque - pbsssh > wrapper > > Hi Gareth, > > For Intel MPI, you also have the option of using OSC mpiexec. Just use > '-comm pmi', which is the same startup mechanism as MPICH2. When using > newer versions of the Intel MPI library with OSC mpiexec, just make > sure to have I_MPI_PMI_EXTENSIONS=on. > > A caveat for using OSC mpiexec is that the startup programs included > with Intel MPI add extra environment variables that influences runtime > behavior of the MPI program. These may be useful. > > We have a similar script as yours below that we use with Platform/HP > MPI. See attached file. It copes with the fact that some packages use > IP addresses on the rsh command line. > > Perhaps the correct solution is to create a proper pbsrsh that accepts > the same command line flags as rsh, and is more robust. This would > also allow MPI libraries such as Intel's and Platform's to have better > Torque integration, with no Torque library dependencies, and without > relying on external programs such as OSC mpiexec. > > Doug Thanks Doug, I hadn't realized that OSC mpiexec could work with Intel MPI. We mostly use OpenMPI and don't need the OSC mpiexec. Your pbsrsh script will be better tested than ours so I think we will move to using it and then on to OSC mpiexec if we have to. Thanks for sharing - it is good to have options and I'm sure others will appreciate the info (and code!) too. I'll try and get around to asking Intel to incorporate such a solution or information in their product - but will have to dig out my Intel customer credentials first... Gareth From samuel at unimelb.edu.au Thu May 3 02:22:07 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 03 May 2012 18:22:07 +1000 Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010503312ABB@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C6201050094DC9A@exvic-mbx04.nexus.csiro.au> <007DECE986B47F4EABF823C1FBB19C62010503312ABB@exvic-mbx04.nexus.csiro.au> Message-ID: <4FA2402F.7060708@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/05/12 14:17, Gareth.Williams at csiro.au wrote: > I'll try and get around to asking Intel to incorporate such a > solution or information in their product I've raised this a number of times with Intel over the years, to no avail (always said they'd look at it, but I don't think anything has come of it). I wouldn't hold your breath.. - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+iQC8ACgkQO2KABBYQAh9AUwCfZSEoWTBDnTmsm5DIHKCjipuC K7QAn3CEHXgykkqxt725GMBmWDdtvWGk =Emuo -----END PGP SIGNATURE----- From roy.dragseth at cc.uit.no Thu May 3 05:51:32 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Thu, 03 May 2012 13:51:32 +0200 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources In-Reply-To: <4FA133F2.80303@erasmusmc.nl> References: <4FA133F2.80303@erasmusmc.nl> Message-ID: <2445104.utCSrFlz9W@newton.cc.uit.no> It seems like you are trying to submit a job wich requires 15 GB pvmem on a nodes that have less than 3 GB total virtual memory. Maui also considers specified memory parameters when scheduling jobs. r. On Wednesday, May 02, 2012 15:17:38 Sebastiaan Breedveld wrote: > Dear list, > > This may be a Maui issue, but the Maui list seems dead :( > > > I am trying to setup a very basic Torque+Maui system. I am running a > Torque cluster for a year now, and wanted to improve the scheduling with > Maui. To this end, I installed a fresh test-system, with server and node > on a single computer. > > Torque version: 2.4.16 > Maui version: 3.3.1 > uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 > UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > > > > I was able to run (simple) jobs with the Torque scheduler. When I > replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by: > > $ qsub -q batch test-script.sh > > where test-script.sh is nothing more than a 'sleep 1m' script. Checking > the job: > > # checkjob -v 55 > checking job 55 (RM job '55.testing.azr.nl') > > State: Idle EState: Deferred > Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT > WallTime: 00:00:00 of 6:00:00 > SubmitTime: Thu Apr 5 13:21:33 > (Time Queued Total: 00:00:32 Eligible: 00:00:01) > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G > Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G > NodeAccess: SHARED > NodeCount: 1 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation for > job '55' (intital reservation attempt) > ) > Holds: Defer (hold reason: NoResources) > PE: 16.03 StartPriority: 1 > cannot select job 55 for partition DEFAULT (job hold active) > > > > show that there are no resources available. The node is free, and unloaded: > > # checknode testing > > > checking node testing.azr.nl > > State: Idle (in current state for 2:23:54) > Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M > Utilized Resources: SWAP: 149M > Dedicated Resources: [NONE] > Opsys: linux Arch: [NONE] > Speed: 1.00 Load: 0.050 > Network: [DEFAULT] > Features: [NONE] > Attributes: [Batch] > Classes: [batch 2:2] > > Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%) > > Reservations: > NOTE: no reservations on node > > > > When the job is added, maui.log shows this: > 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0) > 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) > 04/05 13:21:34 INFO: processing node request line '1' > 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) > 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan > 21600 Idle 0 1333624893 [NONE] [NONE] [NONE] >= 0 >= 0 > [1][ppn=1] 1333624894 > 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING > 04/05 13:21:34 INFO: jobs detected: 12 > 04/05 13:21:34 MStatClearUsage(node,Active) > 04/05 13:21:34 MClusterUpdateNodeState() > 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) > 04/05 13:21:34 INFO: job '40' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '41' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '42' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '44' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '45' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '47' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '48' Priority: 16 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '49' Priority: 12 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '52' Priority: 8 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '53' Priority: 1 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '54' Priority: 60 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '55' Priority: 1 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 MStatClearUsage([NONE],Active) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] > 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) > 04/05 13:21:34 INFO: job '40' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '41' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '42' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '44' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '45' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '47' Priority: 22 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '48' Priority: 16 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '49' Priority: 12 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '52' Priority: 8 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '53' Priority: 1 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '54' Priority: 60 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 INFO: job '55' Priority: 1 > 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > 0(00.0) > 04/05 13:21:34 MStatClearUsage([NONE],Idle) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 MResDestroy(NULL) > 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] > 04/05 13:21:34 > MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) > 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 > 04/05 13:21:34 MQueueScheduleRJobs(Q) > 04/05 13:21:34 > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) > 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 > 04/05 13:21:34 > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) > 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1 > 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) > 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in > partition DEFAULT (1 Needed) > 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) > 04/05 13:21:34 MJobReserve(55,Priority) > 04/05 13:21:34 ALERT: job 55 cannot run in any partition > 04/05 13:21:34 ALERT: cannot create new reservation for job 55 > (shape[1] 1) > 04/05 13:21:34 ALERT: cannot create new reservation for job 55 > 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create > reservation for job '55' (intital reservation attempt) > ) > 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600 > seconds) > 04/05 13:21:34 WARNING: cannot reserve priority job '55' > Active Jobs------ > ------------------ > 04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: 2 > 04/05 13:21:34 > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) > 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1 > [EState: 1] > 04/05 13:21:34 > MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) > 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 > [EState: 1] > 04/05 13:21:34 > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) > 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 > [EState: 1] > 04/05 13:21:34 MSchedUpdateStats() > 04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 seconds > 04/05 13:21:34 MResUpdateStats() > 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% > active jobs: 0 of 2 (completed: 1) > 04/05 13:21:34 MQueueCheckStatus() > 04/05 13:21:34 MNodeCheckStatus() > 04/05 13:21:34 MUClearChild(PID) > 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds > > > I think the relevant line is: > 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in > partition DEFAULT (1 Needed) > > but I have no idea how to make a feasible task for the job. I have tried > queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem > to have had effect. > > > > Torque config. I have tried setting different attributes to the queue > properties, hoping that it would have some effect: > # qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch Priority = 20 > set queue batch max_running = 8 > set queue batch resources_max.ncpus = 8 > set queue batch resources_max.nodect = 10 > set queue batch resources_max.nodes = 2 > set queue batch resources_min.ncpus = 0 > set queue batch resources_default.mem = 2000mb > set queue batch resources_default.ncpus = 1 > set queue batch resources_default.neednodes = 1:ppn=1 > set queue batch resources_default.nodect = 1 > set queue batch resources_default.nodes = 1 > set queue batch resources_default.pvmem = 16000mb > set queue batch resources_default.walltime = 06:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = testing.azr.nl > set server log_events = 511 > set server mail_from = adm > set server resources_available.nodect = 10 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 56 > > > Maui configuration, untouched: > # maui.cfg 3.3.1 > > SERVERHOST testing > # primary admin must be first in list > ADMIN1 root > > # Resource Manager Definition > > RMCFG[TESTING] TYPE=PBS > > # Allocation Manager Definition > > AMCFG[bank] TYPE=NONE > > # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html > # use the 'schedctl -l' command to display current configuration > > RMPOLLINTERVAL 00:00:30 > > SERVERPORT 42559 > SERVERMODE NORMAL > > # Admin: http://supercluster.org/mauidocs/a.esecurity.html > > > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > > # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html > > QUEUETIMEWEIGHT 1 > > # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > > #FSPOLICY PSDEDICATED > #FSDEPTH 7 > #FSINTERVAL 86400 > #FSDECAY 0.80 > > # Throttling Policies: > http://supercluster.org/mauidocs/6.2throttlingpolicies.html > > # NONE SPECIFIED > > # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html > > NODEALLOCATIONPOLICY MINRESOURCE > > # QOS: http://supercluster.org/mauidocs/7.3qos.html > > # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > > # Standing Reservations: > http://supercluster.org/mauidocs/7.1.3standingreservations.html > > # SRSTARTTIME[test] 8:00:00 > # SRENDTIME[test] 17:00:00 > # SRDAYS[test] MON TUE WED THU FRI > # SRTASKCOUNT[test] 20 > # SRMAXTIME[test] 0:30:00 > > # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > > # USERCFG[DEFAULT] FSTARGET=25.0 > # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > # CLASSCFG[batch] FLAGS=PREEMPTEE > # CLASSCFG[interactive] FLAGS=PREEMPTOR > > > > Any ideas? > > Thanks in advance, > Sebastiaan -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From chris.evert at geokinetics.com Thu May 3 07:14:17 2012 From: chris.evert at geokinetics.com (Chris Evert) Date: Thu, 3 May 2012 08:14:17 -0500 Subject: [torqueusers] Job not receiving class in maui In-Reply-To: References: Message-ID: <4FA284A9.4020609@geokinetics.com> Dan, I had a similar problem with maui 3.3. There were not enough slots in a table. My fix was to redefine some constants and recompile: --- include/msched-common.h+ 2010-02-19 14:14:22.000000000 -0600 +++ include/msched-common.h 2011-09-22 12:13:05.000000000 -0500 @@ -475,8 +475,8 @@ unsigned long InitialWorkload; } mrclass_t; -#define MAX_MCLASS 16 -#define MMAX_CLASS 16 +#define MAX_MCLASS 32 +#define MMAX_CLASS 32 #define MAX_MGRES 4 (Sorry to post maui stuff in torqueusers) Regards, Chris On 05/02/2012 04:45 PM, Dan Kulinski wrote: > We have a few queues defined (30 to be exact) and we have strange > behavior occurring on the newer queues we have created. When we submit > the job and it is sent to maui there is no class defined on these newer > queues. The reason we have so many is due to software that interfaces > with the queues not being able to use long queue names when querying for > status. > > So I have two queue configurations. They are exactly the same. When I > submit to one, I get the proper node selection based on the > resources_default.neednodes. On the other I do not. Here are the > configurations: > > Working queue: > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue largeb_multiple > > # > > create queue largeb_multiple > > set queue largeb_multiple queue_type = Execution > > set queue largeb_multiple resources_default.neednodes = 8P32M > > set queue largeb_multiple enabled = True > > set queue largeb_multiple started = True > > Non-working queue: > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue lgB_multi > > # > > create queue lgB_multi > > set queue lgB_multi queue_type = Execution > > set queue lgB_multi resources_default.neednodes = 8P32M > > set queue lgB_multi enabled = True > > set queue lgB_multi started = True > > When I send a job to the queue this is what I see (via the diagnose > command): > > Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime > Network Opsys Arch Mem Disk Procs Class Features > > 32570 Running DEF 8 DEF 00:05:00 1 8 dank wheel - 00:00:03 [NONE] [NONE] > [NONE] >=0 >=0 NC0 [largeb_multiple:1] [8P32M] > > 32571 Running DEF 2 DEF 00:05:00 1 2 dank wheel - 00:00:01 [NONE] [NONE] > [NONE] >=0 >=0 NC0 [NONE] [NONE] > > As you can see, the job that is running correctly has the correct class. > The other job shows up with NONE and features of NONE. Has anyone else > ran into this? Any ideas? > > Thanks, > > Dan Kulinski > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From christina.salls at noaa.gov Thu May 3 05:49:07 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 3 May 2012 07:49:07 -0400 Subject: [torqueusers] python in qsub In-Reply-To: <4FA1B722.2090606@ldeo.columbia.edu> References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> Message-ID: On Wed, May 2, 2012 at 6:37 PM, Gus Correa wrote: > Hi Christina > > This may not fix the problem, but this line on > the job script surprised me: > > #PBS -l select=1:ncpus=1 > > Why not > > #PBS -l nodes=1:ppn=1 > > [assuming this is a serial/non-parallel job]? > > As Axel and James suggested, there may be an issue with the > user environment. > It may be useful to do 'env' at the head node command prompt, > and also inside the job script, to see if they match. > If something important is missing, > a simple fix is to add all needed > environment variables to the user's > .profile/.tcshrc file, but there are other ways > [like environment modules]. > Assuming the home directory is shared on the nodes, > these files should be sourced when the job is launched. > Or to use qsub -V as James suggested. > Hi Gus, Thanks for the response. We did try the qsub -V to no avail. The first thing that I did was to have him include a printenv in the script and also run it outside of the script. If the answer was in the environment variables it was not obvious to me. The major difference was the mpt module that was loaded in the user environment. I am including the output from those commands below. Since python is an interactive program, I am wondering if the same R and rpy2 rpms need to be installed on the compute node? I am thinking of getting mom running on the head node. If he can submit jobs to the head node through torque, that may point to differences between the head node and compute nodes. At any rate, these are the printenv outputs. Interactive MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/sgi/mpt/mpt-2.04/man:/usr/man:/usr/share/catman:/usr/share/man:/usr/catman: HOSTNAME=wings.glerl.noaa.gov SELINUX_ROLE_REQUESTED= INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses TERM=xterm SHELL=/bin/bash HISTSIZE=1000 SSH_CLIENT=192.94.173.186 3021 22 LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib SELINUX_USE_CURRENT_RANGE= FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include QTDIR=/usr/lib64/qt-3.3 QTINC=/usr/lib64/qt-3.3/include DAT_OVERRIDE=/etc/dat.conf SSH_TTY=/dev/pts/27 USER=hunter LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64:/opt/sgi/sgimc/lib LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36: CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N MAIL=/var/spool/mail/hunter PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin I_MPI_RDMA_CREATE_CONN_QUAL=0 PWD=/zeus/d2/users/hunter/lbrm/kriging/batch _LMFILES_=/usr/share/Modules/modulefiles/mpt/2.04 LANG=en_US.UTF-8 MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles LOADEDMODULES=mpt/2.04 MGR_HOME=/opt/sgi/sgimc SELINUX_LEVEL_REQUESTED= SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass HISTCONTROL=ignoredups SHLVL=1 HOME=/zeus/d2/users/hunter MPI_ROOT=/opt/sgi/mpt/mpt-2.04 LOGNAME=hunter QTLIB=/usr/lib64/qt-3.3/lib CVS_RSH=ssh SSH_CONNECTION=192.94.173.186 3021 192.94.173.9 22 MODULESHOME=/usr/share/Modules LESSOPEN=|/usr/bin/lesspipe.sh %s INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include G_BROKEN_FILENAMES=1 module=() { eval `/usr/bin/modulecmd bash $*` } _=/usr/bin/printenv OLDPWD=/zeus/d2/users/hunter Within torque MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/usr/local/share/man:/usr/local/man: HOSTNAME=n020 PBS_VERSION=TORQUE-2.5.9 INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses SHELL=/bin/bash HISTSIZE=1000 PBS_JOBNAME=krigtest LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64 FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include PBS_ENVIRONMENT=PBS_BATCH OLDPWD=/zeus/d2/users/hunter QTDIR=/usr/lib64/qt-3.3 QTINC=/usr/lib64/qt-3.3/include PBS_O_WORKDIR=/zeus/d2/users/hunter/lbrm/kriging/batch DAT_OVERRIDE=/etc/dat.conf USER=hunter PBS_TASKNUM=1 LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64 PBS_O_HOME=/zeus/d2/users/hunter CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include PBS_MOMPORT=15003 PBS_GPUFILE=/var/spool/torque/aux//785.admin.default.domaingpu PBS_O_QUEUE=hydrology NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/zeus/d2/users/hunter/bin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64 PBS_O_LOGNAME=hunter MAIL=/var/spool/mail/hunter PBS_O_LANG=en_US.UTF-8 PBS_JOBCOOKIE=153B7A805713EAF7B67869983204CFBD I_MPI_RDMA_CREATE_CONN_QUAL=0 PWD=/zeus/d2/users/hunter/lbrm/kriging/batch LANG=en_US.UTF-8 PBS_NODENUM=0 MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles PBS_NUM_NODES=1 LOADEDMODULES= PBS_O_SHELL=/bin/bash MGR_HOME=/opt/sgi/sgimc PBS_SERVER=admin.default.domain PBS_JOBID=785.admin.default.domain ENVIRONMENT=BATCH HISTCONTROL=ignoredups HOME=/zeus/d2/users/hunter SHLVL=2 PBS_O_HOST=admin.default.domain PBS_VNODENUM=0 LOGNAME=hunter CVS_RSH=ssh QTLIB=/usr/lib64/qt-3.3/lib PBS_QUEUE=hydrology MODULESHOME=/usr/share/Modules PBS_O_MAIL=/var/spool/mail/hunter LESSOPEN=|/usr/bin/lesspipe.sh %s PBS_NP=1 PBS_NUM_PPN=1 INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include PBS_NODEFILE=/var/spool/torque/aux//785.admin.default.domain G_BROKEN_FILENAMES=1 PBS_O_PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin module=() { eval `/usr/bin/modulecmd bash $*` } _=/usr/bin/printenv > Gus Correa > > > On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > > Christina, > > > > I am going to assume that by interactively, you mean > > just by running the script or the command rather than > > using qsub -I > > If it is the latter, the comments below will not apply. > > > > > > Assuming that your site's policies allow you to run as the user in > question: (Mine do as long as I am acting on a > > request for help from the user, though I always ask if > > it OK first.) > > > > Try ssh'ing into the compute node that had the problem > > as the user in question, then run interactively there, you should see > the same problem. > > > > As root, you can find the node that ran the job by issuing > > tracejob -n 7 786 > > if the job was run with the last 7 days. The node > > on which the job launched would be the first node > > in the nodelist. > > > > Also as root on the head node, you can issue > > su - tim.hunter > > > > to become the user on the head node, then ssh into the node in question > as the user, and try the script interactively. > > > > > > As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > > issue, or you could have variables set in the /etc/profile.d > > scripts or /etc/login.csh of the user is a csh user. > > (Is the home directory shared so that any rc files are > > run the same on all nodes.) > > > > As a workaround, you could see if qsub -V works since that > > will pass the environment. If it does, then some env variables are > being set by the user or by a script that > > runs only on the head node. > > > > > > > > > > > > Also, is /bin/sh the user's normal login shell? > > > >> -----Original Message----- > >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >> bounces at supercluster.org] On Behalf Of Axel Kohlmeyer > >> Sent: Wednesday, May 02, 2012 9:22 AM > >> To: Torque Users Mailing List > >> Subject: Re: [torqueusers] python in qsub > >> > >> christina, > >> > >> have you checked whether the problem isn't with R instead of python? > >> although the PYTHONPATH environment variable would be the primary > >> suspect, but it could even by LD_LIBRARY_PATH. python is a bit > >> strange in reporting certain kinds of error. > >> > >> axel. > >> > >> On Tue, May 1, 2012 at 1:49 PM, Christina Salls > >> wrote: > >>> Hi all, > >>> > >>> My configuration consists of Torque 2.5.9 running on RHEL 6 > >> on a > >>> single cluster. I have a user that is able to interactively run > >>> python and R, but when his job is submitted to Torque, it fails > >> with this message: > >>> > >>> Traceback (most recent call last): > >>> > >>> File "../progs/Kriging_Batch.py", line 110, in > >>> > >>> import rpy2.robjects as robjects > >>> > >>> ImportError: No module named rpy2.robjects > >>> > >>> > >>> I asked him to include a printenv in his script and compared it to > >> the > >>> environment variables that are inherent in his login. I could not > >>> tell from the comparison what might be causing this problem. > >>> > >>> > >>> His script looks like this: > >>> > >>> > >>> [root at zeus batch]# more runkrig.sh > >>> > >>> #!/bin/sh > >>> > >>> #submit with qsub runkrig.sh > >>> > >>> #PBS -N GaugesJob > >>> > >>> #PBS -l select=1:ncpus=1 > >>> > >>> #PBS -q hydrology > >>> > >>> #PBS -M tim.hunter at noaa.gov > >>> > >>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh > >>> intel64 > >>> > >>> > >>> > >>> cd /zeus/d2/users/hunter/lbrm/kriging/batch > >>> > >>> > >>> > >>> echo 'Begin Kriging of Met Data with python and R' > >>> > >>> date > >>> > >>> python ../progs/Kriging_Batch.py $1 $2 $3 > >>> > >>> echo 'Kriging operations complete!' > >>> > >>> date > >>> > >>> > >>> The output file looks like this: > >>> > >>> > >>> [root at zeus batch]# more GaugesJob.o786 > >>> > >>> Begin Kriging of Met Data with python and R > >>> > >>> Fri Apr 27 13:44:42 CDT 2012 > >>> > >>> Kriging operations complete! > >>> > >>> Fri Apr 27 13:44:42 CDT 2012 > >>> > >>> > >>> > >>> The Kriging_Batch.py script works just fine interactively. If I > >> run > >>> the import command interactively, it also works. > >>> > >>> > >>> eg. > >>> > >>> python -c "import rpy2.robjects as robjects" > >>> > >>> > >>> I am sure there is a simple explanation, and if any of you have > >> any > >>> clues to lead me in the right direction, I would greatly > >> appreciate it. > >>> > >>> > >>> Even in the torque environment, the other python imports are > >> working > >>> properly. It only seems to be choking on the rpy2 import. > >>> > >>> > >>> This is the one portion of the script > >>> > >>> > >>> #----------------------------------------------------------------- > >> ---- > >>> ----------------- > >>> > >>> import os > >>> > >>> from os.path import normpath > >>> > >>> import sys > >>> > >>> import shutil > >>> > >>> import csv > >>> > >>> > >>> import rpy2.objects as robjects > >>> > >>> > >>> # > >>> > >>> r = robjects.r > >>> > >>> path = os.getcwd() > >>> > >>> print "curdir = %s" %path > >>> > >>> > >>> # > >>> > >>> oldPath = os.environ['PATH'].split(os.pathsep) > >>> > >>> newPath = os.environ['PATH'].split(os.pathsep) > >>> > >>> newPath = os.pathsep.join(newPath[len(oldPath):] + > >>> newPath[:len(oldPath)]) > >>> > >>> > >>> # > >>> > >>> # Load the R functions > >>> > >>> # > >>> > >>> print "load the R functions" > >>> > >>> > >> r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData > >> ') > >>> > >>> > >>> # > >>> > >>> # Force stdout into an unbuffered mode > >>> > >>> # > >>> > >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > >>> > >>> > >>> > >>> Any advice, clues, hints, moral support, etc.... welcomed and > >> appreciated. > >>> > >>> > >>> Thanks, > >>> > >>> > >>> Christina > >>> > >>> > >>> > >>> > >>> -- > >>> Christina A. Salls > >>> GLERL Computer Group > >>> help.glerl at noaa.gov > >>> Help Desk x2127 > >>> Christina.Salls at noaa.gov > >>> Voice Mail 734-741-2446 > >>> > >>> > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> > >> > >> > >> > >> -- > >> Dr. Axel Kohlmeyer akohlmey at gmail.com > >> http://sites.google.com/site/akohlmey/ > >> > >> Institute for Computational Molecular Science Temple University, > >> Philadelphia PA, USA. > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120503/4d269838/attachment-0001.html From richard_bach at merck.com Thu May 3 07:49:53 2012 From: richard_bach at merck.com (Bach, Richard A) Date: Thu, 3 May 2012 09:49:53 -0400 Subject: [torqueusers] python in qsub In-Reply-To: <4FA1B722.2090606@ldeo.columbia.edu> References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> Message-ID: This looks like a line used for PBSPro and not torque. I this a torque environment? -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa Sent: Wednesday, May 02, 2012 6:37 PM To: Torque Users Mailing List Subject: Re: [torqueusers] python in qsub Hi Christina This may not fix the problem, but this line on the job script surprised me: #PBS -l select=1:ncpus=1 Why not #PBS -l nodes=1:ppn=1 [assuming this is a serial/non-parallel job]? As Axel and James suggested, there may be an issue with the user environment. It may be useful to do 'env' at the head node command prompt, and also inside the job script, to see if they match. If something important is missing, a simple fix is to add all needed environment variables to the user's .profile/.tcshrc file, but there are other ways [like environment modules]. Assuming the home directory is shared on the nodes, these files should be sourced when the job is launched. Or to use qsub -V as James suggested. Gus Correa On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > Christina, > > I am going to assume that by interactively, you mean > just by running the script or the command rather than > using qsub -I > If it is the latter, the comments below will not apply. > > > Assuming that your site's policies allow you to run as the user in question: (Mine do as long as I am acting on a > request for help from the user, though I always ask if > it OK first.) > > Try ssh'ing into the compute node that had the problem > as the user in question, then run interactively there, you should see the same problem. > > As root, you can find the node that ran the job by issuing > tracejob -n 7 786 > if the job was run with the last 7 days. The node > on which the job launched would be the first node > in the nodelist. > > Also as root on the head node, you can issue > su - tim.hunter > > to become the user on the head node, then ssh into the node in question as the user, and try the script interactively. > > > As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > issue, or you could have variables set in the /etc/profile.d > scripts or /etc/login.csh of the user is a csh user. > (Is the home directory shared so that any rc files are > run the same on all nodes.) > > As a workaround, you could see if qsub -V works since that > will pass the environment. If it does, then some env variables are being set by the user or by a script that > runs only on the head node. > > > > > > Also, is /bin/sh the user's normal login shell? > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Axel Kohlmeyer >> Sent: Wednesday, May 02, 2012 9:22 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] python in qsub >> >> christina, >> >> have you checked whether the problem isn't with R instead of python? >> although the PYTHONPATH environment variable would be the primary >> suspect, but it could even by LD_LIBRARY_PATH. python is a bit >> strange in reporting certain kinds of error. >> >> axel. >> >> On Tue, May 1, 2012 at 1:49 PM, Christina Salls >> wrote: >>> Hi all, >>> >>> My configuration consists of Torque 2.5.9 running on RHEL 6 >> on a >>> single cluster. I have a user that is able to interactively run >>> python and R, but when his job is submitted to Torque, it fails >> with this message: >>> >>> Traceback (most recent call last): >>> >>> File "../progs/Kriging_Batch.py", line 110, in >>> >>> import rpy2.robjects as robjects >>> >>> ImportError: No module named rpy2.robjects >>> >>> >>> I asked him to include a printenv in his script and compared it to >> the >>> environment variables that are inherent in his login. I could not >>> tell from the comparison what might be causing this problem. >>> >>> >>> His script looks like this: >>> >>> >>> [root at zeus batch]# more runkrig.sh >>> >>> #!/bin/sh >>> >>> #submit with qsub runkrig.sh >>> >>> #PBS -N GaugesJob >>> >>> #PBS -l select=1:ncpus=1 >>> >>> #PBS -q hydrology >>> >>> #PBS -M tim.hunter at noaa.gov >>> >>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh >>> intel64 >>> >>> >>> >>> cd /zeus/d2/users/hunter/lbrm/kriging/batch >>> >>> >>> >>> echo 'Begin Kriging of Met Data with python and R' >>> >>> date >>> >>> python ../progs/Kriging_Batch.py $1 $2 $3 >>> >>> echo 'Kriging operations complete!' >>> >>> date >>> >>> >>> The output file looks like this: >>> >>> >>> [root at zeus batch]# more GaugesJob.o786 >>> >>> Begin Kriging of Met Data with python and R >>> >>> Fri Apr 27 13:44:42 CDT 2012 >>> >>> Kriging operations complete! >>> >>> Fri Apr 27 13:44:42 CDT 2012 >>> >>> >>> >>> The Kriging_Batch.py script works just fine interactively. If I >> run >>> the import command interactively, it also works. >>> >>> >>> eg. >>> >>> python -c "import rpy2.robjects as robjects" >>> >>> >>> I am sure there is a simple explanation, and if any of you have >> any >>> clues to lead me in the right direction, I would greatly >> appreciate it. >>> >>> >>> Even in the torque environment, the other python imports are >> working >>> properly. It only seems to be choking on the rpy2 import. >>> >>> >>> This is the one portion of the script >>> >>> >>> #----------------------------------------------------------------- >> ---- >>> ----------------- >>> >>> import os >>> >>> from os.path import normpath >>> >>> import sys >>> >>> import shutil >>> >>> import csv >>> >>> >>> import rpy2.objects as robjects >>> >>> >>> # >>> >>> r = robjects.r >>> >>> path = os.getcwd() >>> >>> print "curdir = %s" %path >>> >>> >>> # >>> >>> oldPath = os.environ['PATH'].split(os.pathsep) >>> >>> newPath = os.environ['PATH'].split(os.pathsep) >>> >>> newPath = os.pathsep.join(newPath[len(oldPath):] + >>> newPath[:len(oldPath)]) >>> >>> >>> # >>> >>> # Load the R functions >>> >>> # >>> >>> print "load the R functions" >>> >>> >> r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData >> ') >>> >>> >>> # >>> >>> # Force stdout into an unbuffered mode >>> >>> # >>> >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) >>> >>> >>> >>> Any advice, clues, hints, moral support, etc.... welcomed and >> appreciated. >>> >>> >>> Thanks, >>> >>> >>> Christina >>> >>> >>> >>> >>> -- >>> Christina A. Salls >>> GLERL Computer Group >>> help.glerl at noaa.gov >>> Help Desk x2127 >>> Christina.Salls at noaa.gov >>> Voice Mail 734-741-2446 >>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> >> >> -- >> Dr. Axel Kohlmeyer akohlmey at gmail.com >> http://sites.google.com/site/akohlmey/ >> >> Institute for Computational Molecular Science Temple University, >> Philadelphia PA, USA. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. From WJEdsall at dow.com Thu May 3 11:18:57 2012 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Thu, 3 May 2012 17:18:57 +0000 Subject: [torqueusers] strategies for bad nodes In-Reply-To: References: Message-ID: Hi list, Thanks to Rick and Axel for responding here. You led me in the right direction! I just wanted to give my own response to catalog the specifics of this solution...at least the way we did it. At first I assumed it would be the scheduler managing the downed nodes, but it appears that the pbs_mom controls the offline/online status of the node. Here are the mom config settings we used: $node_check_script /your_path_to/node_health.sh $node_check_interval 1,jobstart,jobend $down_on_error 1 Now paired with a simple node_health.sh script - if the script returns ERROR - some message, the node will be offlined and a message will be set. Note that while the Torque documentation suggests that the message will display in pbsnodes -ln, this was not the case for us. The message does however show up in the status line for each nodes' individual pbsnodes information. Here is a simple script I'm testing to check for full scratch, root, infiniband, and ypbind: #!/bin/bash SCRSPC=`df -h /scr/ | tail -n1 | awk '{print $5}' | sed 's/%//' 2> /dev/null` if [ "$SCRSPC" -gt "50" ];then echo "ERROR /scr/ over 50%" fi ROOTSPC=`df -h / | tail -n1 | awk '{print $5}' | sed 's/%//' 2> /dev/null` if [ "$ROOTSPC" -gt "75" ];then echo "ERROR / over 75%" fi YPCHK=`ypwhich 2> /dev/null` if [ "$YPCHK" != "tsunami.local" ];then echo "ERROR ypbind" fi IBSTAT=`which ibstat 2> /dev/null` if [ -e "${IBSTAT}" ];then IBCHK=`ibstat | grep LinkUp | wc -l` IBEXISTS=`ibstat` if [ -n "${IBEXISTS}" -a "${IBCHK}" -lt "1" ];then echo "ERROR Infiniband link down (check ibstat)" fi fi exit 0 If anyone has suggestions for more useful 'checks' please comment. These are pretty custom to our environment but I'd like to hear of more. I tried the program from warewulf, but it wasn't working well for me. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Rick McKay Sent: Tuesday, April 17, 2012 10:15 AM To: Torque Users Mailing List Subject: Re: [torqueusers] strategies for bad nodes Hello William, Michael Jennings at Lawrence Berkley just did a great presentation about the Node Health Check subproject of Warewulf that you might want to look into that, too. It's an excellent expansion of what's in the Adaptive TORQUE docs. It's well-documented, implemented almost entirely in bash, and easy to extend for you specific needs. http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check Regards, Rick On Tue, Apr 17, 2012 at 7:02 AM, Edsall, William (WJ) > wrote: Hello list, I'm looking for ideas on how to prevent jobs from going to 'bad' nodes. There are a small handful of items which define a bad node for us such as ypbind not bound, maybe /scr is full, etc. We need to be able to customize this list. What might be built into torque to achieve this? It would be ideal if the node was not only passed by for a job but even offlined with a comment. Thanks, William _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120503/635ecfa3/attachment-0001.html From smdogra at gmail.com Thu May 3 11:21:25 2012 From: smdogra at gmail.com (Dr. Sunil M. Dogra) Date: Thu, 3 May 2012 14:21:25 -0300 Subject: [torqueusers] Unsubscribe Message-ID: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dr. Sunil Manohar Dogra FAPESP Postdoctoral Research Fellow, Instituto de Fisica Teorica - UNESP Rua Dr. Bento Teobaldo Ferraz 271, Bl. II, Barra Funda, 01140-070, Sao Paulo, Brazil. Cell Phone: +55 11 5121 5364 Tel. No.: +55 11 3393-7861 email:sdogra at cern.ch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120503/0d5e21e8/attachment.html From gus at ldeo.columbia.edu Thu May 3 11:25:46 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 May 2012 13:25:46 -0400 Subject: [torqueusers] python in qsub In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> Message-ID: <4FA2BF9A.9080707@ldeo.columbia.edu> Hi Christina I don't know if I am on target with my answer. If I missed the point, forgive me please. 1) Two alternative ways to make software available on all nodes [R, R Python modules, Matlab, MPI,you name them] are: A) Install on every node [say, via RPM, apt-get, whatever] B) Install on a directory that is shared by all nodes [typically a directory on the head node or on a storage node, NFS mounted on all nodes]. If you chose A), you need to install everywhere. B) is less scalable, but is OK for clusters that are not too big [like yours]. I use B) for everything that needs installation from source code, including MPI, in small clusters. A) is harder to maintain 'by hand' but it is feasible. Alternatively, you may want to elect a node to be your 'master' node, install stuff there, and then propagate that node image to the other nodes. How you propagate may depend very much on the software you use to manage the cluster. For instance, Rocks Clusters have their own protocol to do these things, xCat has a different way, etc. ** Interesting that you seem to have some SGI software in there. If I remember right mpt is the SGI stuff for their MPI version. Is this an SGI cluster [or was it in the past]? ** I agree with you. I can't see any smoking gun in the environment variables. But if you didn't install the R and rpy2 RPMs on the nodes, that is most likely the problem. In any case, I strongly suggest that you keep the nodes homogeneous. The only exception is the head node at most. What you install in one, install in all of them, regardless of how you do it. Clusters with heterogeneous software are hard to manage. I hope this helps, Gus Correa On 05/03/2012 07:49 AM, Christina Salls wrote: > > > On Wed, May 2, 2012 at 6:37 PM, Gus Correa > wrote: > > Hi Christina > > This may not fix the problem, but this line on > the job script surprised me: > > #PBS -l select=1:ncpus=1 > > Why not > > #PBS -l nodes=1:ppn=1 > > [assuming this is a serial/non-parallel job]? > > As Axel and James suggested, there may be an issue with the > user environment. > It may be useful to do 'env' at the head node command prompt, > and also inside the job script, to see if they match. > If something important is missing, > a simple fix is to add all needed > environment variables to the user's > .profile/.tcshrc file, but there are other ways > [like environment modules]. > Assuming the home directory is shared on the nodes, > these files should be sourced when the job is launched. > Or to use qsub -V as James suggested. > > > Hi Gus, > > Thanks for the response. We did try the qsub -V to no avail. The > first thing that I did was to have him include a printenv in the script > and also run it outside of the script. If the answer was in the > environment variables it was not obvious to me. The major difference > was the mpt module that was loaded in the user environment. I am > including the output from those commands below. Since python is an > interactive program, I am wondering if the same R and rpy2 rpms need to > be installed on the compute node? I am thinking of getting mom running > on the head node. If he can submit jobs to the head node through > torque, that may point to differences between the head node and compute > nodes. At any rate, these are the printenv outputs. > > Interactive > MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl > MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/sgi/mpt/mpt-2.04/man:/usr/man:/usr/share/catman:/usr/share/man:/usr/catman: > HOSTNAME=wings.glerl.noaa.gov > SELINUX_ROLE_REQUESTED= > INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses > TERM=xterm > SHELL=/bin/bash > HISTSIZE=1000 > SSH_CLIENT=192.94.173.186 3021 22 > LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib > SELINUX_USE_CURRENT_RANGE= > FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include > QTDIR=/usr/lib64/qt-3.3 > QTINC=/usr/lib64/qt-3.3/include > DAT_OVERRIDE=/etc/dat.conf > SSH_TTY=/dev/pts/27 > USER=hunter > LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64:/opt/sgi/sgimc/lib > LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;3! 5:*.xcf=01 ;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36: > CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include > NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N > MAIL=/var/spool/mail/hunter > PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin > I_MPI_RDMA_CREATE_CONN_QUAL=0 > PWD=/zeus/d2/users/hunter/lbrm/kriging/batch > _LMFILES_=/usr/share/Modules/modulefiles/mpt/2.04 > LANG=en_US.UTF-8 > MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > LOADEDMODULES=mpt/2.04 > MGR_HOME=/opt/sgi/sgimc > SELINUX_LEVEL_REQUESTED= > SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass > HISTCONTROL=ignoredups > SHLVL=1 > HOME=/zeus/d2/users/hunter > MPI_ROOT=/opt/sgi/mpt/mpt-2.04 > LOGNAME=hunter > QTLIB=/usr/lib64/qt-3.3/lib > CVS_RSH=ssh > SSH_CONNECTION=192.94.173.186 3021 192.94.173.9 22 > MODULESHOME=/usr/share/Modules > LESSOPEN=|/usr/bin/lesspipe.sh %s > INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > G_BROKEN_FILENAMES=1 > module=() { eval `/usr/bin/modulecmd bash $*` > } > _=/usr/bin/printenv > OLDPWD=/zeus/d2/users/hunter > Within torque > > MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl > MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/usr/local/share/man:/usr/local/man: > HOSTNAME=n020 > PBS_VERSION=TORQUE-2.5.9 > INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses > SHELL=/bin/bash > HISTSIZE=1000 > PBS_JOBNAME=krigtest > LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64 > FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > PBS_ENVIRONMENT=PBS_BATCH > OLDPWD=/zeus/d2/users/hunter > QTDIR=/usr/lib64/qt-3.3 > QTINC=/usr/lib64/qt-3.3/include > PBS_O_WORKDIR=/zeus/d2/users/hunter/lbrm/kriging/batch > DAT_OVERRIDE=/etc/dat.conf > USER=hunter > PBS_TASKNUM=1 > LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64 > PBS_O_HOME=/zeus/d2/users/hunter > CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > PBS_MOMPORT=15003 > PBS_GPUFILE=/var/spool/torque/aux//785.admin.default.domaingpu > PBS_O_QUEUE=hydrology > NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N > PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/zeus/d2/users/hunter/bin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64 > PBS_O_LOGNAME=hunter > MAIL=/var/spool/mail/hunter > PBS_O_LANG=en_US.UTF-8 > PBS_JOBCOOKIE=153B7A805713EAF7B67869983204CFBD > I_MPI_RDMA_CREATE_CONN_QUAL=0 > PWD=/zeus/d2/users/hunter/lbrm/kriging/batch > LANG=en_US.UTF-8 > PBS_NODENUM=0 > MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > PBS_NUM_NODES=1 > LOADEDMODULES= > PBS_O_SHELL=/bin/bash > MGR_HOME=/opt/sgi/sgimc > PBS_SERVER=admin.default.domain > PBS_JOBID=785.admin.default.domain > ENVIRONMENT=BATCH > HISTCONTROL=ignoredups > HOME=/zeus/d2/users/hunter > SHLVL=2 > PBS_O_HOST=admin.default.domain > PBS_VNODENUM=0 > LOGNAME=hunter > CVS_RSH=ssh > QTLIB=/usr/lib64/qt-3.3/lib > PBS_QUEUE=hydrology > MODULESHOME=/usr/share/Modules > PBS_O_MAIL=/var/spool/mail/hunter > LESSOPEN=|/usr/bin/lesspipe.sh %s > PBS_NP=1 > PBS_NUM_PPN=1 > INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > PBS_NODEFILE=/var/spool/torque/aux//785.admin.default.domain > G_BROKEN_FILENAMES=1 > PBS_O_PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin > module=() { eval `/usr/bin/modulecmd bash $*` > } > _=/usr/bin/printenv > > > Gus Correa > > > On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > > Christina, > > > > I am going to assume that by interactively, you mean > > just by running the script or the command rather than > > using qsub -I > > If it is the latter, the comments below will not apply. > > > > > > Assuming that your site's policies allow you to run as the > user in question: (Mine do as long as I am acting on a > > request for help from the user, though I always ask if > > it OK first.) > > > > Try ssh'ing into the compute node that had the problem > > as the user in question, then run interactively there, you should > see the same problem. > > > > As root, you can find the node that ran the job by issuing > > tracejob -n 7 786 > > if the job was run with the last 7 days. The node > > on which the job launched would be the first node > > in the nodelist. > > > > Also as root on the head node, you can issue > > su - tim.hunter > > > > to become the user on the head node, then ssh into the node in > question as the user, and try the script interactively. > > > > > > As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > > issue, or you could have variables set in the /etc/profile.d > > scripts or /etc/login.csh of the user is a csh user. > > (Is the home directory shared so that any rc files are > > run the same on all nodes.) > > > > As a workaround, you could see if qsub -V works since that > > will pass the environment. If it does, then some env variables > are being set by the user or by a script that > > runs only on the head node. > > > > > > > > > > > > Also, is /bin/sh the user's normal login shell? > > > >> -----Original Message----- > >> From: torqueusers-bounces at supercluster.org > [mailto:torqueusers- > > >> bounces at supercluster.org ] On > Behalf Of Axel Kohlmeyer > >> Sent: Wednesday, May 02, 2012 9:22 AM > >> To: Torque Users Mailing List > >> Subject: Re: [torqueusers] python in qsub > >> > >> christina, > >> > >> have you checked whether the problem isn't with R instead of python? > >> although the PYTHONPATH environment variable would be the primary > >> suspect, but it could even by LD_LIBRARY_PATH. python is a bit > >> strange in reporting certain kinds of error. > >> > >> axel. > >> > >> On Tue, May 1, 2012 at 1:49 PM, Christina Salls > >> > wrote: > >>> Hi all, > >>> > >>> My configuration consists of Torque 2.5.9 running on RHEL 6 > >> on a > >>> single cluster. I have a user that is able to interactively run > >>> python and R, but when his job is submitted to Torque, it fails > >> with this message: > >>> > >>> Traceback (most recent call last): > >>> > >>> File "../progs/Kriging_Batch.py", line 110, in > >>> > >>> import rpy2.robjects as robjects > >>> > >>> ImportError: No module named rpy2.robjects > >>> > >>> > >>> I asked him to include a printenv in his script and compared it to > >> the > >>> environment variables that are inherent in his login. I could not > >>> tell from the comparison what might be causing this problem. > >>> > >>> > >>> His script looks like this: > >>> > >>> > >>> [root at zeus batch]# more runkrig.sh > >>> > >>> #!/bin/sh > >>> > >>> #submit with qsub runkrig.sh > >>> > >>> #PBS -N GaugesJob > >>> > >>> #PBS -l select=1:ncpus=1 > >>> > >>> #PBS -q hydrology > >>> > >>> #PBS -M tim.hunter at noaa.gov > >>> > >>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh > >>> intel64 > >>> > >>> > >>> > >>> cd /zeus/d2/users/hunter/lbrm/kriging/batch > >>> > >>> > >>> > >>> echo 'Begin Kriging of Met Data with python and R' > >>> > >>> date > >>> > >>> python ../progs/Kriging_Batch.py $1 $2 $3 > >>> > >>> echo 'Kriging operations complete!' > >>> > >>> date > >>> > >>> > >>> The output file looks like this: > >>> > >>> > >>> [root at zeus batch]# more GaugesJob.o786 > >>> > >>> Begin Kriging of Met Data with python and R > >>> > >>> Fri Apr 27 13:44:42 CDT 2012 > >>> > >>> Kriging operations complete! > >>> > >>> Fri Apr 27 13:44:42 CDT 2012 > >>> > >>> > >>> > >>> The Kriging_Batch.py script works just fine interactively. If I > >> run > >>> the import command interactively, it also works. > >>> > >>> > >>> eg. > >>> > >>> python -c "import rpy2.robjects as robjects" > >>> > >>> > >>> I am sure there is a simple explanation, and if any of you have > >> any > >>> clues to lead me in the right direction, I would greatly > >> appreciate it. > >>> > >>> > >>> Even in the torque environment, the other python imports are > >> working > >>> properly. It only seems to be choking on the rpy2 import. > >>> > >>> > >>> This is the one portion of the script > >>> > >>> > >>> #----------------------------------------------------------------- > >> ---- > >>> ----------------- > >>> > >>> import os > >>> > >>> from os.path import normpath > >>> > >>> import sys > >>> > >>> import shutil > >>> > >>> import csv > >>> > >>> > >>> import rpy2.objects as robjects > >>> > >>> > >>> # > >>> > >>> r = robjects.r > >>> > >>> path = os.getcwd() > >>> > >>> print "curdir = %s" %path > >>> > >>> > >>> # > >>> > >>> oldPath = os.environ['PATH'].split(os.pathsep) > >>> > >>> newPath = os.environ['PATH'].split(os.pathsep) > >>> > >>> newPath = os.pathsep.join(newPath[len(oldPath):] + > >>> newPath[:len(oldPath)]) > >>> > >>> > >>> # > >>> > >>> # Load the R functions > >>> > >>> # > >>> > >>> print "load the R functions" > >>> > >>> > >> r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData > >> ') > >>> > >>> > >>> # > >>> > >>> # Force stdout into an unbuffered mode > >>> > >>> # > >>> > >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > >>> > >>> > >>> > >>> Any advice, clues, hints, moral support, etc.... welcomed and > >> appreciated. > >>> > >>> > >>> Thanks, > >>> > >>> > >>> Christina > >>> > >>> > >>> > >>> > >>> -- > >>> Christina A. Salls > >>> GLERL Computer Group > >>> help.glerl at noaa.gov > >>> Help Desk x2127 > >>> Christina.Salls at noaa.gov > >>> Voice Mail 734-741-2446 > >>> > >>> > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> > >> > >> > >> > >> -- > >> Dr. Axel Kohlmeyer akohlmey at gmail.com > >> http://sites.google.com/site/akohlmey/ > >> > >> Institute for Computational Molecular Science Temple University, > >> Philadelphia PA, USA. > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Thu May 3 11:35:38 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 May 2012 13:35:38 -0400 Subject: [torqueusers] python in qsub In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> Message-ID: <4FA2C1EA.20904@ldeo.columbia.edu> Hi Christina As Richard suggests, your user seems to have copied this script over from another cluster with another resource manager / queue system, maybe PBSPro, not Torque. It happens all the time. Gus Correa On 05/03/2012 09:49 AM, Bach, Richard A wrote: > This looks like a line used for PBSPro and not torque. I this a torque environment? > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa > Sent: Wednesday, May 02, 2012 6:37 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] python in qsub > > Hi Christina > > This may not fix the problem, but this line on > the job script surprised me: > > #PBS -l select=1:ncpus=1 > > Why not > > #PBS -l nodes=1:ppn=1 > > [assuming this is a serial/non-parallel job]? > > As Axel and James suggested, there may be an issue with the > user environment. > It may be useful to do 'env' at the head node command prompt, > and also inside the job script, to see if they match. > If something important is missing, > a simple fix is to add all needed > environment variables to the user's > .profile/.tcshrc file, but there are other ways > [like environment modules]. > Assuming the home directory is shared on the nodes, > these files should be sourced when the job is launched. > Or to use qsub -V as James suggested. > > Gus Correa > > > On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: >> Christina, >> >> I am going to assume that by interactively, you mean >> just by running the script or the command rather than >> using qsub -I >> If it is the latter, the comments below will not apply. >> >> >> Assuming that your site's policies allow you to run as the user in question: (Mine do as long as I am acting on a >> request for help from the user, though I always ask if >> it OK first.) >> >> Try ssh'ing into the compute node that had the problem >> as the user in question, then run interactively there, you should see the same problem. >> >> As root, you can find the node that ran the job by issuing >> tracejob -n 7 786 >> if the job was run with the last 7 days. The node >> on which the job launched would be the first node >> in the nodelist. >> >> Also as root on the head node, you can issue >> su - tim.hunter >> >> to become the user on the head node, then ssh into the node in question as the user, and try the script interactively. >> >> >> As Alex said, it could be a ldconfig or LD_LIBRARY_PATH >> issue, or you could have variables set in the /etc/profile.d >> scripts or /etc/login.csh of the user is a csh user. >> (Is the home directory shared so that any rc files are >> run the same on all nodes.) >> >> As a workaround, you could see if qsub -V works since that >> will pass the environment. If it does, then some env variables are being set by the user or by a script that >> runs only on the head node. >> >> >> >> >> >> Also, is /bin/sh the user's normal login shell? >> >>> -----Original Message----- >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >>> bounces at supercluster.org] On Behalf Of Axel Kohlmeyer >>> Sent: Wednesday, May 02, 2012 9:22 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] python in qsub >>> >>> christina, >>> >>> have you checked whether the problem isn't with R instead of python? >>> although the PYTHONPATH environment variable would be the primary >>> suspect, but it could even by LD_LIBRARY_PATH. python is a bit >>> strange in reporting certain kinds of error. >>> >>> axel. >>> >>> On Tue, May 1, 2012 at 1:49 PM, Christina Salls >>> wrote: >>>> Hi all, >>>> >>>> My configuration consists of Torque 2.5.9 running on RHEL 6 >>> on a >>>> single cluster. I have a user that is able to interactively run >>>> python and R, but when his job is submitted to Torque, it fails >>> with this message: >>>> >>>> Traceback (most recent call last): >>>> >>>> File "../progs/Kriging_Batch.py", line 110, in >>>> >>>> import rpy2.robjects as robjects >>>> >>>> ImportError: No module named rpy2.robjects >>>> >>>> >>>> I asked him to include a printenv in his script and compared it to >>> the >>>> environment variables that are inherent in his login. I could not >>>> tell from the comparison what might be causing this problem. >>>> >>>> >>>> His script looks like this: >>>> >>>> >>>> [root at zeus batch]# more runkrig.sh >>>> >>>> #!/bin/sh >>>> >>>> #submit with qsub runkrig.sh >>>> >>>> #PBS -N GaugesJob >>>> >>>> #PBS -l select=1:ncpus=1 >>>> >>>> #PBS -q hydrology >>>> >>>> #PBS -M tim.hunter at noaa.gov >>>> >>>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh >>>> intel64 >>>> >>>> >>>> >>>> cd /zeus/d2/users/hunter/lbrm/kriging/batch >>>> >>>> >>>> >>>> echo 'Begin Kriging of Met Data with python and R' >>>> >>>> date >>>> >>>> python ../progs/Kriging_Batch.py $1 $2 $3 >>>> >>>> echo 'Kriging operations complete!' >>>> >>>> date >>>> >>>> >>>> The output file looks like this: >>>> >>>> >>>> [root at zeus batch]# more GaugesJob.o786 >>>> >>>> Begin Kriging of Met Data with python and R >>>> >>>> Fri Apr 27 13:44:42 CDT 2012 >>>> >>>> Kriging operations complete! >>>> >>>> Fri Apr 27 13:44:42 CDT 2012 >>>> >>>> >>>> >>>> The Kriging_Batch.py script works just fine interactively. If I >>> run >>>> the import command interactively, it also works. >>>> >>>> >>>> eg. >>>> >>>> python -c "import rpy2.robjects as robjects" >>>> >>>> >>>> I am sure there is a simple explanation, and if any of you have >>> any >>>> clues to lead me in the right direction, I would greatly >>> appreciate it. >>>> >>>> >>>> Even in the torque environment, the other python imports are >>> working >>>> properly. It only seems to be choking on the rpy2 import. >>>> >>>> >>>> This is the one portion of the script >>>> >>>> >>>> #----------------------------------------------------------------- >>> ---- >>>> ----------------- >>>> >>>> import os >>>> >>>> from os.path import normpath >>>> >>>> import sys >>>> >>>> import shutil >>>> >>>> import csv >>>> >>>> >>>> import rpy2.objects as robjects >>>> >>>> >>>> # >>>> >>>> r = robjects.r >>>> >>>> path = os.getcwd() >>>> >>>> print "curdir = %s" %path >>>> >>>> >>>> # >>>> >>>> oldPath = os.environ['PATH'].split(os.pathsep) >>>> >>>> newPath = os.environ['PATH'].split(os.pathsep) >>>> >>>> newPath = os.pathsep.join(newPath[len(oldPath):] + >>>> newPath[:len(oldPath)]) >>>> >>>> >>>> # >>>> >>>> # Load the R functions >>>> >>>> # >>>> >>>> print "load the R functions" >>>> >>>> >>> r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData >>> ') >>>> >>>> >>>> # >>>> >>>> # Force stdout into an unbuffered mode >>>> >>>> # >>>> >>>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) >>>> >>>> >>>> >>>> Any advice, clues, hints, moral support, etc.... welcomed and >>> appreciated. >>>> >>>> >>>> Thanks, >>>> >>>> >>>> Christina >>>> >>>> >>>> >>>> >>>> -- >>>> Christina A. Salls >>>> GLERL Computer Group >>>> help.glerl at noaa.gov >>>> Help Desk x2127 >>>> Christina.Salls at noaa.gov >>>> Voice Mail 734-741-2446 >>>> >>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >>> >>> >>> -- >>> Dr. Axel Kohlmeyer akohlmey at gmail.com >>> http://sites.google.com/site/akohlmey/ >>> >>> Institute for Computational Molecular Science Temple University, >>> Philadelphia PA, USA. >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > Notice: This e-mail message, together with any attachments, contains > information of Merck& Co., Inc. (One Merck Drive, Whitehouse Station, > New Jersey, USA 08889), and/or its affiliates Direct contact information > for affiliates is available at > http://www.merck.com/contact/contacts.html) that may be confidential, > proprietary copyrighted and/or legally privileged. It is intended solely > for the use of the individual or entity named on this message. If you are > not the intended recipient, and have received this message in error, > please notify us immediately by reply e-mail and then delete it from > your system. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From christina.salls at noaa.gov Thu May 3 11:52:19 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 3 May 2012 13:52:19 -0400 Subject: [torqueusers] python in qsub In-Reply-To: <4FA2BF9A.9080707@ldeo.columbia.edu> References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> <4FA2BF9A.9080707@ldeo.columbia.edu> Message-ID: On Thu, May 3, 2012 at 1:25 PM, Gus Correa wrote: > Hi Christina > > I don't know if I am on target with my answer. > If I missed the point, forgive me please. > > 1) Two alternative ways to make software available on > all nodes [R, R Python modules, Matlab, MPI,you name them] > are: > > A) Install on every node [say, via RPM, apt-get, whatever] > B) Install on a directory that is shared by all nodes [typically > a directory on the head node or on a storage node, > NFS mounted on all nodes]. > If you chose A), you need to install everywhere. B) is less > scalable, but is OK for clusters that are not too big [like yours]. > > I use B) for everything that needs installation from source code, > including MPI, in small clusters. > > You are exactly on target, and I thank you for your response. I have been thinking that B) is the answer. I just haven't made the change yet because unfortunately, when I got torque working, I used the interactive script to install and configure mom on the compute nodes and will have to repeat that procedure every time I provision the nodes. my bad. The head node does have directories mounted on the compute nodes already, /intel and /data1. R lives in /usr/lib64/R/bin. I have to figure out how to either create a link in /data1 and/or create an alias in .bash or .tsch to point to the headnode (point to the link)? I am still unclear about that. > A) is harder to maintain 'by hand' but it is feasible. > Alternatively, you may want to elect a node to > be your 'master' node, install stuff there, and then propagate > that node image to the other nodes. > How you propagate may depend very much on the software you use > to manage the cluster. For instance, Rocks Clusters have their > own protocol to do these things, xCat has a different way, etc. > > ** > > Interesting that you seem to have some SGI software in there. > If I remember right mpt is the SGI stuff for their MPI version. > Is this an SGI cluster [or was it in the past]? > > This is an SGI cluster. ( I am an ex-sgi employee actually) > ** > > I agree with you. > I can't see any smoking gun in the environment variables. > But if you didn't install the R and rpy2 RPMs on the nodes, > that is most likely the problem. > > In any case, I strongly suggest that you keep the > nodes homogeneous. The only exception is the head node at most. > What you install in one, install in all of them, regardless of > how you do it. > Clusters with heterogeneous software are hard to manage. > > > I hope this helps, > Gus Correa > > On 05/03/2012 07:49 AM, Christina Salls wrote: > > > > > > On Wed, May 2, 2012 at 6:37 PM, Gus Correa > > wrote: > > > > Hi Christina > > > > This may not fix the problem, but this line on > > the job script surprised me: > > > > #PBS -l select=1:ncpus=1 > > > > Why not > > > > #PBS -l nodes=1:ppn=1 > > > > [assuming this is a serial/non-parallel job]? > > > > As Axel and James suggested, there may be an issue with the > > user environment. > > It may be useful to do 'env' at the head node command prompt, > > and also inside the job script, to see if they match. > > If something important is missing, > > a simple fix is to add all needed > > environment variables to the user's > > .profile/.tcshrc file, but there are other ways > > [like environment modules]. > > Assuming the home directory is shared on the nodes, > > these files should be sourced when the job is launched. > > Or to use qsub -V as James suggested. > > > > > > Hi Gus, > > > > Thanks for the response. We did try the qsub -V to no avail. The > > first thing that I did was to have him include a printenv in the script > > and also run it outside of the script. If the answer was in the > > environment variables it was not obvious to me. The major difference > > was the mpt module that was loaded in the user environment. I am > > including the output from those commands below. Since python is an > > interactive program, I am wondering if the same R and rpy2 rpms need to > > be installed on the compute node? I am thinking of getting mom running > > on the head node. If he can submit jobs to the head node through > > torque, that may point to differences between the head node and compute > > nodes. At any rate, these are the printenv outputs. > > > > Interactive > > MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl > > > MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/sgi/mpt/mpt-2.04/man:/usr/man:/usr/share/catman:/usr/share/man:/usr/catman: > > HOSTNAME=wings.glerl.noaa.gov > > SELINUX_ROLE_REQUESTED= > > > INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses > > TERM=xterm > > SHELL=/bin/bash > > HISTSIZE=1000 > > SSH_CLIENT=192.94.173.186 3021 22 > > > LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib > > SELINUX_USE_CURRENT_RANGE= > > > FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include > > QTDIR=/usr/lib64/qt-3.3 > > QTINC=/usr/lib64/qt-3.3/include > > DAT_OVERRIDE=/etc/dat.conf > > SSH_TTY=/dev/pts/27 > > USER=hunter > > > LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64:/opt/sgi/sgimc/lib > > > LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;3! > 5:*.xcf=01 > > ;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36: > > > CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include > > > NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N > > MAIL=/var/spool/mail/hunter > > > PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin > > I_MPI_RDMA_CREATE_CONN_QUAL=0 > > PWD=/zeus/d2/users/hunter/lbrm/kriging/batch > > _LMFILES_=/usr/share/Modules/modulefiles/mpt/2.04 > > LANG=en_US.UTF-8 > > MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > > LOADEDMODULES=mpt/2.04 > > MGR_HOME=/opt/sgi/sgimc > > SELINUX_LEVEL_REQUESTED= > > SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass > > HISTCONTROL=ignoredups > > SHLVL=1 > > HOME=/zeus/d2/users/hunter > > MPI_ROOT=/opt/sgi/mpt/mpt-2.04 > > LOGNAME=hunter > > QTLIB=/usr/lib64/qt-3.3/lib > > CVS_RSH=ssh > > SSH_CONNECTION=192.94.173.186 3021 192.94.173.9 22 > > MODULESHOME=/usr/share/Modules > > LESSOPEN=|/usr/bin/lesspipe.sh %s > > INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > G_BROKEN_FILENAMES=1 > > module=() { eval `/usr/bin/modulecmd bash $*` > > } > > _=/usr/bin/printenv > > OLDPWD=/zeus/d2/users/hunter > > Within torque > > > > MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl > > > MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/usr/local/share/man:/usr/local/man: > > HOSTNAME=n020 > > PBS_VERSION=TORQUE-2.5.9 > > > INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses > > SHELL=/bin/bash > > HISTSIZE=1000 > > PBS_JOBNAME=krigtest > > > LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64 > > FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > PBS_ENVIRONMENT=PBS_BATCH > > OLDPWD=/zeus/d2/users/hunter > > QTDIR=/usr/lib64/qt-3.3 > > QTINC=/usr/lib64/qt-3.3/include > > PBS_O_WORKDIR=/zeus/d2/users/hunter/lbrm/kriging/batch > > DAT_OVERRIDE=/etc/dat.conf > > USER=hunter > > PBS_TASKNUM=1 > > > LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64 > > PBS_O_HOME=/zeus/d2/users/hunter > > CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > PBS_MOMPORT=15003 > > PBS_GPUFILE=/var/spool/torque/aux//785.admin.default.domaingpu > > PBS_O_QUEUE=hydrology > > > NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N > > > PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/zeus/d2/users/hunter/bin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64 > > PBS_O_LOGNAME=hunter > > MAIL=/var/spool/mail/hunter > > PBS_O_LANG=en_US.UTF-8 > > PBS_JOBCOOKIE=153B7A805713EAF7B67869983204CFBD > > I_MPI_RDMA_CREATE_CONN_QUAL=0 > > PWD=/zeus/d2/users/hunter/lbrm/kriging/batch > > LANG=en_US.UTF-8 > > PBS_NODENUM=0 > > MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > > PBS_NUM_NODES=1 > > LOADEDMODULES= > > PBS_O_SHELL=/bin/bash > > MGR_HOME=/opt/sgi/sgimc > > PBS_SERVER=admin.default.domain > > PBS_JOBID=785.admin.default.domain > > ENVIRONMENT=BATCH > > HISTCONTROL=ignoredups > > HOME=/zeus/d2/users/hunter > > SHLVL=2 > > PBS_O_HOST=admin.default.domain > > PBS_VNODENUM=0 > > LOGNAME=hunter > > CVS_RSH=ssh > > QTLIB=/usr/lib64/qt-3.3/lib > > PBS_QUEUE=hydrology > > MODULESHOME=/usr/share/Modules > > PBS_O_MAIL=/var/spool/mail/hunter > > LESSOPEN=|/usr/bin/lesspipe.sh %s > > PBS_NP=1 > > PBS_NUM_PPN=1 > > INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > PBS_NODEFILE=/var/spool/torque/aux//785.admin.default.domain > > G_BROKEN_FILENAMES=1 > > > PBS_O_PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin > > module=() { eval `/usr/bin/modulecmd bash $*` > > } > > _=/usr/bin/printenv > > > > > > Gus Correa > > > > > > On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > > > Christina, > > > > > > I am going to assume that by interactively, you mean > > > just by running the script or the command rather than > > > using qsub -I > > > If it is the latter, the comments below will not apply. > > > > > > > > > Assuming that your site's policies allow you to run as the > > user in question: (Mine do as long as I am acting on a > > > request for help from the user, though I always ask if > > > it OK first.) > > > > > > Try ssh'ing into the compute node that had the problem > > > as the user in question, then run interactively there, you should > > see the same problem. > > > > > > As root, you can find the node that ran the job by issuing > > > tracejob -n 7 786 > > > if the job was run with the last 7 days. The node > > > on which the job launched would be the first node > > > in the nodelist. > > > > > > Also as root on the head node, you can issue > > > su - tim.hunter > > > > > > to become the user on the head node, then ssh into the node in > > question as the user, and try the script interactively. > > > > > > > > > As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > > > issue, or you could have variables set in the /etc/profile.d > > > scripts or /etc/login.csh of the user is a csh user. > > > (Is the home directory shared so that any rc files are > > > run the same on all nodes.) > > > > > > As a workaround, you could see if qsub -V works since that > > > will pass the environment. If it does, then some env variables > > are being set by the user or by a script that > > > runs only on the head node. > > > > > > > > > > > > > > > > > > Also, is /bin/sh the user's normal login shell? > > > > > >> -----Original Message----- > > >> From: torqueusers-bounces at supercluster.org > > [mailto:torqueusers- > > > > >> bounces at supercluster.org ] On > > Behalf Of Axel Kohlmeyer > > >> Sent: Wednesday, May 02, 2012 9:22 AM > > >> To: Torque Users Mailing List > > >> Subject: Re: [torqueusers] python in qsub > > >> > > >> christina, > > >> > > >> have you checked whether the problem isn't with R instead of > python? > > >> although the PYTHONPATH environment variable would be the primary > > >> suspect, but it could even by LD_LIBRARY_PATH. python is a bit > > >> strange in reporting certain kinds of error. > > >> > > >> axel. > > >> > > >> On Tue, May 1, 2012 at 1:49 PM, Christina Salls > > >> > > wrote: > > >>> Hi all, > > >>> > > >>> My configuration consists of Torque 2.5.9 running on RHEL > 6 > > >> on a > > >>> single cluster. I have a user that is able to interactively run > > >>> python and R, but when his job is submitted to Torque, it fails > > >> with this message: > > >>> > > >>> Traceback (most recent call last): > > >>> > > >>> File "../progs/Kriging_Batch.py", line 110, in > > >>> > > >>> import rpy2.robjects as robjects > > >>> > > >>> ImportError: No module named rpy2.robjects > > >>> > > >>> > > >>> I asked him to include a printenv in his script and compared it > to > > >> the > > >>> environment variables that are inherent in his login. I could > not > > >>> tell from the comparison what might be causing this problem. > > >>> > > >>> > > >>> His script looks like this: > > >>> > > >>> > > >>> [root at zeus batch]# more runkrig.sh > > >>> > > >>> #!/bin/sh > > >>> > > >>> #submit with qsub runkrig.sh > > >>> > > >>> #PBS -N GaugesJob > > >>> > > >>> #PBS -l select=1:ncpus=1 > > >>> > > >>> #PBS -q hydrology > > >>> > > >>> #PBS -M tim.hunter at noaa.gov > > >>> > > >>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh > > >>> intel64 > > >>> > > >>> > > >>> > > >>> cd /zeus/d2/users/hunter/lbrm/kriging/batch > > >>> > > >>> > > >>> > > >>> echo 'Begin Kriging of Met Data with python and R' > > >>> > > >>> date > > >>> > > >>> python ../progs/Kriging_Batch.py $1 $2 $3 > > >>> > > >>> echo 'Kriging operations complete!' > > >>> > > >>> date > > >>> > > >>> > > >>> The output file looks like this: > > >>> > > >>> > > >>> [root at zeus batch]# more GaugesJob.o786 > > >>> > > >>> Begin Kriging of Met Data with python and R > > >>> > > >>> Fri Apr 27 13:44:42 CDT 2012 > > >>> > > >>> Kriging operations complete! > > >>> > > >>> Fri Apr 27 13:44:42 CDT 2012 > > >>> > > >>> > > >>> > > >>> The Kriging_Batch.py script works just fine interactively. If I > > >> run > > >>> the import command interactively, it also works. > > >>> > > >>> > > >>> eg. > > >>> > > >>> python -c "import rpy2.robjects as robjects" > > >>> > > >>> > > >>> I am sure there is a simple explanation, and if any of you have > > >> any > > >>> clues to lead me in the right direction, I would greatly > > >> appreciate it. > > >>> > > >>> > > >>> Even in the torque environment, the other python imports are > > >> working > > >>> properly. It only seems to be choking on the rpy2 import. > > >>> > > >>> > > >>> This is the one portion of the script > > >>> > > >>> > > >>> > #----------------------------------------------------------------- > > >> ---- > > >>> ----------------- > > >>> > > >>> import os > > >>> > > >>> from os.path import normpath > > >>> > > >>> import sys > > >>> > > >>> import shutil > > >>> > > >>> import csv > > >>> > > >>> > > >>> import rpy2.objects as robjects > > >>> > > >>> > > >>> # > > >>> > > >>> r = robjects.r > > >>> > > >>> path = os.getcwd() > > >>> > > >>> print "curdir = %s" %path > > >>> > > >>> > > >>> # > > >>> > > >>> oldPath = os.environ['PATH'].split(os.pathsep) > > >>> > > >>> newPath = os.environ['PATH'].split(os.pathsep) > > >>> > > >>> newPath = os.pathsep.join(newPath[len(oldPath):] + > > >>> newPath[:len(oldPath)]) > > >>> > > >>> > > >>> # > > >>> > > >>> # Load the R functions > > >>> > > >>> # > > >>> > > >>> print "load the R functions" > > >>> > > >>> > > >> > r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData > > >> ') > > >>> > > >>> > > >>> # > > >>> > > >>> # Force stdout into an unbuffered mode > > >>> > > >>> # > > >>> > > >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > > >>> > > >>> > > >>> > > >>> Any advice, clues, hints, moral support, etc.... welcomed and > > >> appreciated. > > >>> > > >>> > > >>> Thanks, > > >>> > > >>> > > >>> Christina > > >>> > > >>> > > >>> > > >>> > > >>> -- > > >>> Christina A. Salls > > >>> GLERL Computer Group > > >>> help.glerl at noaa.gov > > >>> Help Desk x2127 > > >>> Christina.Salls at noaa.gov > > >>> Voice Mail 734-741-2446 > > >>> > > >>> > > >>> > > >>> _______________________________________________ > > >>> torqueusers mailing list > > >>> torqueusers at supercluster.org torqueusers at supercluster.org> > > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > >>> > > >> > > >> > > >> > > >> -- > > >> Dr. Axel Kohlmeyer akohlmey at gmail.com > > > >> http://sites.google.com/site/akohlmey/ > > >> > > >> Institute for Computational Molecular Science Temple University, > > >> Philadelphia PA, USA. > > >> _______________________________________________ > > >> torqueusers mailing list > > >> torqueusers at supercluster.org torqueusers at supercluster.org> > > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120503/f02b2640/attachment-0001.html From christina.salls at noaa.gov Thu May 3 11:54:56 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 3 May 2012 13:54:56 -0400 Subject: [torqueusers] python in qsub In-Reply-To: <4FA2C1EA.20904@ldeo.columbia.edu> References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> <4FA2C1EA.20904@ldeo.columbia.edu> Message-ID: On Thu, May 3, 2012 at 1:35 PM, Gus Correa wrote: > Hi Christina > > As Richard suggests, > your user seems to have copied this script over from another > cluster with another resource manager / queue system, > maybe PBSPro, not Torque. > It happens all the time. > > I am not sure where the syntax came from but it is easy enough to change. I will try the suggested syntax for the -l line but unfortunately, I don't expect it to resolve this particular issue. > Gus Correa > > > On 05/03/2012 09:49 AM, Bach, Richard A wrote: > > This looks like a line used for PBSPro and not torque. I this a torque > environment? > > > > > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org [mailto: > torqueusers-bounces at supercluster.org] On Behalf Of Gus Correa > > Sent: Wednesday, May 02, 2012 6:37 PM > > To: Torque Users Mailing List > > Subject: Re: [torqueusers] python in qsub > > > > Hi Christina > > > > This may not fix the problem, but this line on > > the job script surprised me: > > > > #PBS -l select=1:ncpus=1 > > > > Why not > > > > #PBS -l nodes=1:ppn=1 > > > > [assuming this is a serial/non-parallel job]? > > > > As Axel and James suggested, there may be an issue with the > > user environment. > > It may be useful to do 'env' at the head node command prompt, > > and also inside the job script, to see if they match. > > If something important is missing, > > a simple fix is to add all needed > > environment variables to the user's > > .profile/.tcshrc file, but there are other ways > > [like environment modules]. > > Assuming the home directory is shared on the nodes, > > these files should be sourced when the job is launched. > > Or to use qsub -V as James suggested. > > > > Gus Correa > > > > > > On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > >> Christina, > >> > >> I am going to assume that by interactively, you mean > >> just by running the script or the command rather than > >> using qsub -I > >> If it is the latter, the comments below will not apply. > >> > >> > >> Assuming that your site's policies allow you to run as the user in > question: (Mine do as long as I am acting on a > >> request for help from the user, though I always ask if > >> it OK first.) > >> > >> Try ssh'ing into the compute node that had the problem > >> as the user in question, then run interactively there, you should see > the same problem. > >> > >> As root, you can find the node that ran the job by issuing > >> tracejob -n 7 786 > >> if the job was run with the last 7 days. The node > >> on which the job launched would be the first node > >> in the nodelist. > >> > >> Also as root on the head node, you can issue > >> su - tim.hunter > >> > >> to become the user on the head node, then ssh into the node in question > as the user, and try the script interactively. > >> > >> > >> As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > >> issue, or you could have variables set in the /etc/profile.d > >> scripts or /etc/login.csh of the user is a csh user. > >> (Is the home directory shared so that any rc files are > >> run the same on all nodes.) > >> > >> As a workaround, you could see if qsub -V works since that > >> will pass the environment. If it does, then some env variables are > being set by the user or by a script that > >> runs only on the head node. > >> > >> > >> > >> > >> > >> Also, is /bin/sh the user's normal login shell? > >> > >>> -----Original Message----- > >>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- > >>> bounces at supercluster.org] On Behalf Of Axel Kohlmeyer > >>> Sent: Wednesday, May 02, 2012 9:22 AM > >>> To: Torque Users Mailing List > >>> Subject: Re: [torqueusers] python in qsub > >>> > >>> christina, > >>> > >>> have you checked whether the problem isn't with R instead of python? > >>> although the PYTHONPATH environment variable would be the primary > >>> suspect, but it could even by LD_LIBRARY_PATH. python is a bit > >>> strange in reporting certain kinds of error. > >>> > >>> axel. > >>> > >>> On Tue, May 1, 2012 at 1:49 PM, Christina Salls > >>> wrote: > >>>> Hi all, > >>>> > >>>> My configuration consists of Torque 2.5.9 running on RHEL 6 > >>> on a > >>>> single cluster. I have a user that is able to interactively run > >>>> python and R, but when his job is submitted to Torque, it fails > >>> with this message: > >>>> > >>>> Traceback (most recent call last): > >>>> > >>>> File "../progs/Kriging_Batch.py", line 110, in > >>>> > >>>> import rpy2.robjects as robjects > >>>> > >>>> ImportError: No module named rpy2.robjects > >>>> > >>>> > >>>> I asked him to include a printenv in his script and compared it to > >>> the > >>>> environment variables that are inherent in his login. I could not > >>>> tell from the comparison what might be causing this problem. > >>>> > >>>> > >>>> His script looks like this: > >>>> > >>>> > >>>> [root at zeus batch]# more runkrig.sh > >>>> > >>>> #!/bin/sh > >>>> > >>>> #submit with qsub runkrig.sh > >>>> > >>>> #PBS -N GaugesJob > >>>> > >>>> #PBS -l select=1:ncpus=1 > >>>> > >>>> #PBS -q hydrology > >>>> > >>>> #PBS -M tim.hunter at noaa.gov > >>>> > >>>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh > >>>> intel64 > >>>> > >>>> > >>>> > >>>> cd /zeus/d2/users/hunter/lbrm/kriging/batch > >>>> > >>>> > >>>> > >>>> echo 'Begin Kriging of Met Data with python and R' > >>>> > >>>> date > >>>> > >>>> python ../progs/Kriging_Batch.py $1 $2 $3 > >>>> > >>>> echo 'Kriging operations complete!' > >>>> > >>>> date > >>>> > >>>> > >>>> The output file looks like this: > >>>> > >>>> > >>>> [root at zeus batch]# more GaugesJob.o786 > >>>> > >>>> Begin Kriging of Met Data with python and R > >>>> > >>>> Fri Apr 27 13:44:42 CDT 2012 > >>>> > >>>> Kriging operations complete! > >>>> > >>>> Fri Apr 27 13:44:42 CDT 2012 > >>>> > >>>> > >>>> > >>>> The Kriging_Batch.py script works just fine interactively. If I > >>> run > >>>> the import command interactively, it also works. > >>>> > >>>> > >>>> eg. > >>>> > >>>> python -c "import rpy2.robjects as robjects" > >>>> > >>>> > >>>> I am sure there is a simple explanation, and if any of you have > >>> any > >>>> clues to lead me in the right direction, I would greatly > >>> appreciate it. > >>>> > >>>> > >>>> Even in the torque environment, the other python imports are > >>> working > >>>> properly. It only seems to be choking on the rpy2 import. > >>>> > >>>> > >>>> This is the one portion of the script > >>>> > >>>> > >>>> #----------------------------------------------------------------- > >>> ---- > >>>> ----------------- > >>>> > >>>> import os > >>>> > >>>> from os.path import normpath > >>>> > >>>> import sys > >>>> > >>>> import shutil > >>>> > >>>> import csv > >>>> > >>>> > >>>> import rpy2.objects as robjects > >>>> > >>>> > >>>> # > >>>> > >>>> r = robjects.r > >>>> > >>>> path = os.getcwd() > >>>> > >>>> print "curdir = %s" %path > >>>> > >>>> > >>>> # > >>>> > >>>> oldPath = os.environ['PATH'].split(os.pathsep) > >>>> > >>>> newPath = os.environ['PATH'].split(os.pathsep) > >>>> > >>>> newPath = os.pathsep.join(newPath[len(oldPath):] + > >>>> newPath[:len(oldPath)]) > >>>> > >>>> > >>>> # > >>>> > >>>> # Load the R functions > >>>> > >>>> # > >>>> > >>>> print "load the R functions" > >>>> > >>>> > >>> r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData > >>> ') > >>>> > >>>> > >>>> # > >>>> > >>>> # Force stdout into an unbuffered mode > >>>> > >>>> # > >>>> > >>>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > >>>> > >>>> > >>>> > >>>> Any advice, clues, hints, moral support, etc.... welcomed and > >>> appreciated. > >>>> > >>>> > >>>> Thanks, > >>>> > >>>> > >>>> Christina > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Christina A. Salls > >>>> GLERL Computer Group > >>>> help.glerl at noaa.gov > >>>> Help Desk x2127 > >>>> Christina.Salls at noaa.gov > >>>> Voice Mail 734-741-2446 > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> torqueusers mailing list > >>>> torqueusers at supercluster.org > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>> > >>> > >>> > >>> > >>> -- > >>> Dr. Axel Kohlmeyer akohlmey at gmail.com > >>> http://sites.google.com/site/akohlmey/ > >>> > >>> Institute for Computational Molecular Science Temple University, > >>> Philadelphia PA, USA. > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > Notice: This e-mail message, together with any attachments, contains > > information of Merck& Co., Inc. (One Merck Drive, Whitehouse Station, > > New Jersey, USA 08889), and/or its affiliates Direct contact information > > for affiliates is available at > > http://www.merck.com/contact/contacts.html) that may be confidential, > > proprietary copyrighted and/or legally privileged. It is intended solely > > for the use of the individual or entity named on this message. If you are > > not the intended recipient, and have received this message in error, > > please notify us immediately by reply e-mail and then delete it from > > your system. > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120503/254af691/attachment-0001.html From daniel at dep.fem.unicamp.br Thu May 3 12:09:49 2012 From: daniel at dep.fem.unicamp.br (Daniel Lopes de Carvalho) Date: Thu, 03 May 2012 15:09:49 -0300 Subject: [torqueusers] qsub email notifications. Message-ID: <4FA2C9ED.60407@dep.fem.unicamp.br> Hello, I have TORQUE/MAUI installed on a cluster and it works normally. However, when I submit a job with command: qsub -m abe -M user at domain..., I don't get the e-mail notifications. All machines in the cluster can send emails and I could't find sendmail attempts in the log files. Can anyone help me? Thank you Daniel -- Daniel Lopes de Carvalho daniel at dep.fem.unicamp.br http://www.unisim.cepetro.unicamp.br 19 3521-1221 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120503/ed9b4383/attachment.html From gus at ldeo.columbia.edu Thu May 3 13:03:07 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 May 2012 15:03:07 -0400 Subject: [torqueusers] python in qsub In-Reply-To: References: <242421BFAF465844BE24EB90BB97E221090905F3@ITSDAG1D.its.iastate.edu> <4FA1B722.2090606@ldeo.columbia.edu> <4FA2BF9A.9080707@ldeo.columbia.edu> Message-ID: <4FA2D66B.2010402@ldeo.columbia.edu> Hi Christina See comments below, please. On 05/03/2012 01:52 PM, Christina Salls wrote: > > > On Thu, May 3, 2012 at 1:25 PM, Gus Correa > wrote: > > Hi Christina > > I don't know if I am on target with my answer. > If I missed the point, forgive me please. > > 1) Two alternative ways to make software available on > all nodes [R, R Python modules, Matlab, MPI,you name them] > are: > > A) Install on every node [say, via RPM, apt-get, whatever] > B) Install on a directory that is shared by all nodes [typically > a directory on the head node or on a storage node, > NFS mounted on all nodes]. > If you chose A), you need to install everywhere. B) is less > scalable, but is OK for clusters that are not too big [like yours]. > > I use B) for everything that needs installation from source code, > including MPI, in small clusters. > > You are exactly on target, and I thank you for your response. I have > been thinking that B) is the answer. I just haven't made the change yet > because unfortunately, when I got torque working, I used the interactive > script to install and configure mom on the compute nodes and will have > to repeat that procedure every time I provision the nodes. my bad. Not sure if I understand how this relates to the problem with R and rpy2. > The > head node does have directories mounted on the compute nodes already, > /intel and /data1. Does it share /home also? Sharing /home makes life easier. > R lives in /usr/lib64/R/bin. I have to figure out > how to either create a link in /data1 and/or create an alias in .bash or > .tsch to point to the headnode (point to the link)? I am still unclear > about that. It will be hard to make such links work, as /usr/lib64 already exists on each node. This is becoming more of a cluster administration than a Torque discussion, more appropriate for the Beowulf or Rocks Cluster mailing lists, but the Torque list may forgive us a bit. Besides the soft link spaghetti problem above, the downside of using method B) with RPM-based software are the dependencies amongst the various RPMs. I bet the executables in /usr/lib64/R/bin depend on a bunch of shared libraries from other packages, as ldd may tell you. If so, it may not be viable to solve the problem via links. You may need to install the RPMs cluster-wide as in method A). That is where some sort of cluster management software helps, to maintain a homogeneous image of all compute nodes. But you can still install the RPMs manually anyway, and keep a record of what you do, with an eye on consistency. However, for software installed from source or from commercial packages [I guess even R can be installed this way], method B) saves time, keeps the cluster consistent, and enables all nodes to use that software. I do this with compilers, MPI, Matlab, etc, etc, and that is what most people with small clusters [and one-person teams busy with other things] do. I hope this helps, Gus Correa > > A) is harder to maintain 'by hand' but it is feasible. > Alternatively, you may want to elect a node to > be your 'master' node, install stuff there, and then propagate > that node image to the other nodes. > How you propagate may depend very much on the software you use > to manage the cluster. For instance, Rocks Clusters have their > own protocol to do these things, xCat has a different way, etc. > > ** > > Interesting that you seem to have some SGI software in there. > If I remember right mpt is the SGI stuff for their MPI version. > Is this an SGI cluster [or was it in the past]? > > This is an SGI cluster. ( I am an ex-sgi employee actually) > > ** > > I agree with you. > I can't see any smoking gun in the environment variables. > But if you didn't install the R and rpy2 RPMs on the nodes, > that is most likely the problem. > > In any case, I strongly suggest that you keep the > nodes homogeneous. The only exception is the head node at most. > What you install in one, install in all of them, regardless of > how you do it. > Clusters with heterogeneous software are hard to manage. > > > I hope this helps, > Gus Correa > > On 05/03/2012 07:49 AM, Christina Salls wrote: > > > > > > On Wed, May 2, 2012 at 6:37 PM, Gus Correa > > >> wrote: > > > > Hi Christina > > > > This may not fix the problem, but this line on > > the job script surprised me: > > > > #PBS -l select=1:ncpus=1 > > > > Why not > > > > #PBS -l nodes=1:ppn=1 > > > > [assuming this is a serial/non-parallel job]? > > > > As Axel and James suggested, there may be an issue with the > > user environment. > > It may be useful to do 'env' at the head node command prompt, > > and also inside the job script, to see if they match. > > If something important is missing, > > a simple fix is to add all needed > > environment variables to the user's > > .profile/.tcshrc file, but there are other ways > > [like environment modules]. > > Assuming the home directory is shared on the nodes, > > these files should be sourced when the job is launched. > > Or to use qsub -V as James suggested. > > > > > > Hi Gus, > > > > Thanks for the response. We did try the qsub -V to no > avail. The > > first thing that I did was to have him include a printenv in the > script > > and also run it outside of the script. If the answer was in the > > environment variables it was not obvious to me. The major difference > > was the mpt module that was loaded in the user environment. I am > > including the output from those commands below. Since python is an > > interactive program, I am wondering if the same R and rpy2 rpms > need to > > be installed on the compute node? I am thinking of getting mom > running > > on the head node. If he can submit jobs to the head node through > > torque, that may point to differences between the head node and > compute > > nodes. At any rate, these are the printenv outputs. > > > > Interactive > > MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl > > > MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/sgi/mpt/mpt-2.04/man:/usr/man:/usr/share/catman:/usr/share/man:/usr/catman: > > HOSTNAME=wings.glerl.noaa.gov > > > SELINUX_ROLE_REQUESTED= > > > INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses > > TERM=xterm > > SHELL=/bin/bash > > HISTSIZE=1000 > > SSH_CLIENT=192.94.173.186 3021 22 > > > LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib > > SELINUX_USE_CURRENT_RANGE= > > > FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include > > QTDIR=/usr/lib64/qt-3.3 > > QTINC=/usr/lib64/qt-3.3/include > > DAT_OVERRIDE=/etc/dat.conf > > SSH_TTY=/dev/pts/27 > > USER=hunter > > > LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/sgi/mpt/mpt-2.04/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64:/opt/sgi/sgimc/lib > > > LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=! 01;3! > 5:*.xcf=01 > ;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36: > > > CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include:/opt/sgi/mpt/mpt-2.04/include > > > NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N > > MAIL=/var/spool/mail/hunter > > > PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin > > I_MPI_RDMA_CREATE_CONN_QUAL=0 > > PWD=/zeus/d2/users/hunter/lbrm/kriging/batch > > _LMFILES_=/usr/share/Modules/modulefiles/mpt/2.04 > > LANG=en_US.UTF-8 > > MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > > LOADEDMODULES=mpt/2.04 > > MGR_HOME=/opt/sgi/sgimc > > SELINUX_LEVEL_REQUESTED= > > SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass > > HISTCONTROL=ignoredups > > SHLVL=1 > > HOME=/zeus/d2/users/hunter > > MPI_ROOT=/opt/sgi/mpt/mpt-2.04 > > LOGNAME=hunter > > QTLIB=/usr/lib64/qt-3.3/lib > > CVS_RSH=ssh > > SSH_CONNECTION=192.94.173.186 3021 > 192.94.173.9 22 > > MODULESHOME=/usr/share/Modules > > LESSOPEN=|/usr/bin/lesspipe.sh %s > > INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > G_BROKEN_FILENAMES=1 > > module=() { eval `/usr/bin/modulecmd bash $*` > > } > > _=/usr/bin/printenv > > OLDPWD=/zeus/d2/users/hunter > > Within torque > > > > MKLROOT=/opt/intel/composer_xe_2011_sp1.7.256/mkl > > > MANPATH=/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/opt/intel/composer_xe_2011_sp1.7.256/man/en_US:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/usr/local/share/man:/usr/local/man: > > HOSTNAME=n020 > > PBS_VERSION=TORQUE-2.5.9 > > > INTEL_LICENSE_FILE=/opt/intel/composer_xe_2011_sp1.7.256/licenses:/opt/intel/licenses:/zeus/d2/users/hunter/intel/licenses > > SHELL=/bin/bash > > HISTSIZE=1000 > > PBS_JOBNAME=krigtest > > > LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64 > > FPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > PBS_ENVIRONMENT=PBS_BATCH > > OLDPWD=/zeus/d2/users/hunter > > QTDIR=/usr/lib64/qt-3.3 > > QTINC=/usr/lib64/qt-3.3/include > > PBS_O_WORKDIR=/zeus/d2/users/hunter/lbrm/kriging/batch > > DAT_OVERRIDE=/etc/dat.conf > > USER=hunter > > PBS_TASKNUM=1 > > > LD_LIBRARY_PATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/opt/intel/composer_xe_2011_sp1.7.256/debugger/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/lib/intel64 > > PBS_O_HOME=/zeus/d2/users/hunter > > CPATH=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > PBS_MOMPORT=15003 > > PBS_GPUFILE=/var/spool/torque/aux//785.admin.default.domaingpu > > PBS_O_QUEUE=hydrology > > > NLSPATH=/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/composer_xe_2011_sp1.7.256/debugger/intel64/locale/%l_%t/%N > > > PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/zeus/d2/users/hunter/bin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64 > > PBS_O_LOGNAME=hunter > > MAIL=/var/spool/mail/hunter > > PBS_O_LANG=en_US.UTF-8 > > PBS_JOBCOOKIE=153B7A805713EAF7B67869983204CFBD > > I_MPI_RDMA_CREATE_CONN_QUAL=0 > > PWD=/zeus/d2/users/hunter/lbrm/kriging/batch > > LANG=en_US.UTF-8 > > PBS_NODENUM=0 > > MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles > > PBS_NUM_NODES=1 > > LOADEDMODULES= > > PBS_O_SHELL=/bin/bash > > MGR_HOME=/opt/sgi/sgimc > > PBS_SERVER=admin.default.domain > > PBS_JOBID=785.admin.default.domain > > ENVIRONMENT=BATCH > > HISTCONTROL=ignoredups > > HOME=/zeus/d2/users/hunter > > SHLVL=2 > > PBS_O_HOST=admin.default.domain > > PBS_VNODENUM=0 > > LOGNAME=hunter > > CVS_RSH=ssh > > QTLIB=/usr/lib64/qt-3.3/lib > > PBS_QUEUE=hydrology > > MODULESHOME=/usr/share/Modules > > PBS_O_MAIL=/var/spool/mail/hunter > > LESSOPEN=|/usr/bin/lesspipe.sh %s > > PBS_NP=1 > > PBS_NUM_PPN=1 > > INCLUDE=/opt/intel/composer_xe_2011_sp1.7.256/mkl/include > > PBS_NODEFILE=/var/spool/torque/aux//785.admin.default.domain > > G_BROKEN_FILENAMES=1 > > > PBS_O_PATH=/opt/intel/composer_xe_2011_sp1.7.256/bin/intel64:/opt/sgi/mpt/mpt-2.04/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/sgi/sgimc/bin:/opt/sgi/sbin:/opt/intel/composer_xe_2011_sp1.7.256/mpirt/bin/intel64:/zeus/d2/users/hunter/bin > > module=() { eval `/usr/bin/modulecmd bash $*` > > } > > _=/usr/bin/printenv > > > > > > Gus Correa > > > > > > On 05/02/2012 11:28 AM, Coyle, James J [ITACD] wrote: > > > Christina, > > > > > > I am going to assume that by interactively, you mean > > > just by running the script or the command rather than > > > using qsub -I > > > If it is the latter, the comments below will not apply. > > > > > > > > > Assuming that your site's policies allow you to run as the > > user in question: (Mine do as long as I am acting on a > > > request for help from the user, though I always ask if > > > it OK first.) > > > > > > Try ssh'ing into the compute node that had the problem > > > as the user in question, then run interactively there, you should > > see the same problem. > > > > > > As root, you can find the node that ran the job by issuing > > > tracejob -n 7 786 > > > if the job was run with the last 7 days. The node > > > on which the job launched would be the first node > > > in the nodelist. > > > > > > Also as root on the head node, you can issue > > > su - tim.hunter > > > > > > to become the user on the head node, then ssh into the node in > > question as the user, and try the script interactively. > > > > > > > > > As Alex said, it could be a ldconfig or LD_LIBRARY_PATH > > > issue, or you could have variables set in the /etc/profile.d > > > scripts or /etc/login.csh of the user is a csh user. > > > (Is the home directory shared so that any rc files are > > > run the same on all nodes.) > > > > > > As a workaround, you could see if qsub -V works since that > > > will pass the environment. If it does, then some env variables > > are being set by the user or by a script that > > > runs only on the head node. > > > > > > > > > > > > > > > > > > Also, is /bin/sh the user's normal login shell? > > > > > >> -----Original Message----- > > >> From: torqueusers-bounces at supercluster.org > > > > [mailto:torqueusers- > > > > > > >> bounces at supercluster.org > >] On > > Behalf Of Axel Kohlmeyer > > >> Sent: Wednesday, May 02, 2012 9:22 AM > > >> To: Torque Users Mailing List > > >> Subject: Re: [torqueusers] python in qsub > > >> > > >> christina, > > >> > > >> have you checked whether the problem isn't with R instead of > python? > > >> although the PYTHONPATH environment variable would be the primary > > >> suspect, but it could even by LD_LIBRARY_PATH. python is a bit > > >> strange in reporting certain kinds of error. > > >> > > >> axel. > > >> > > >> On Tue, May 1, 2012 at 1:49 PM, Christina Salls > > >> > >> > wrote: > > >>> Hi all, > > >>> > > >>> My configuration consists of Torque 2.5.9 running on RHEL 6 > > >> on a > > >>> single cluster. I have a user that is able to interactively run > > >>> python and R, but when his job is submitted to Torque, it fails > > >> with this message: > > >>> > > >>> Traceback (most recent call last): > > >>> > > >>> File "../progs/Kriging_Batch.py", line 110, in > > >>> > > >>> import rpy2.robjects as robjects > > >>> > > >>> ImportError: No module named rpy2.robjects > > >>> > > >>> > > >>> I asked him to include a printenv in his script and compared > it to > > >> the > > >>> environment variables that are inherent in his login. I > could not > > >>> tell from the comparison what might be causing this problem. > > >>> > > >>> > > >>> His script looks like this: > > >>> > > >>> > > >>> [root at zeus batch]# more runkrig.sh > > >>> > > >>> #!/bin/sh > > >>> > > >>> #submit with qsub runkrig.sh > > >>> > > >>> #PBS -N GaugesJob > > >>> > > >>> #PBS -l select=1:ncpus=1 > > >>> > > >>> #PBS -q hydrology > > >>> > > >>> #PBS -M tim.hunter at noaa.gov > > > > >>> > > >>> source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh > > >>> intel64 > > >>> > > >>> > > >>> > > >>> cd /zeus/d2/users/hunter/lbrm/kriging/batch > > >>> > > >>> > > >>> > > >>> echo 'Begin Kriging of Met Data with python and R' > > >>> > > >>> date > > >>> > > >>> python ../progs/Kriging_Batch.py $1 $2 $3 > > >>> > > >>> echo 'Kriging operations complete!' > > >>> > > >>> date > > >>> > > >>> > > >>> The output file looks like this: > > >>> > > >>> > > >>> [root at zeus batch]# more GaugesJob.o786 > > >>> > > >>> Begin Kriging of Met Data with python and R > > >>> > > >>> Fri Apr 27 13:44:42 CDT 2012 > > >>> > > >>> Kriging operations complete! > > >>> > > >>> Fri Apr 27 13:44:42 CDT 2012 > > >>> > > >>> > > >>> > > >>> The Kriging_Batch.py script works just fine interactively. If I > > >> run > > >>> the import command interactively, it also works. > > >>> > > >>> > > >>> eg. > > >>> > > >>> python -c "import rpy2.robjects as robjects" > > >>> > > >>> > > >>> I am sure there is a simple explanation, and if any of you have > > >> any > > >>> clues to lead me in the right direction, I would greatly > > >> appreciate it. > > >>> > > >>> > > >>> Even in the torque environment, the other python imports are > > >> working > > >>> properly. It only seems to be choking on the rpy2 import. > > >>> > > >>> > > >>> This is the one portion of the script > > >>> > > >>> > > >>> > #----------------------------------------------------------------- > > >> ---- > > >>> ----------------- > > >>> > > >>> import os > > >>> > > >>> from os.path import normpath > > >>> > > >>> import sys > > >>> > > >>> import shutil > > >>> > > >>> import csv > > >>> > > >>> > > >>> import rpy2.objects as robjects > > >>> > > >>> > > >>> # > > >>> > > >>> r = robjects.r > > >>> > > >>> path = os.getcwd() > > >>> > > >>> print "curdir = %s" %path > > >>> > > >>> > > >>> # > > >>> > > >>> oldPath = os.environ['PATH'].split(os.pathsep) > > >>> > > >>> newPath = os.environ['PATH'].split(os.pathsep) > > >>> > > >>> newPath = os.pathsep.join(newPath[len(oldPath):] + > > >>> newPath[:len(oldPath)]) > > >>> > > >>> > > >>> # > > >>> > > >>> # Load the R functions > > >>> > > >>> # > > >>> > > >>> print "load the R functions" > > >>> > > >>> > > >> > r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData > > >> ') > > >>> > > >>> > > >>> # > > >>> > > >>> # Force stdout into an unbuffered mode > > >>> > > >>> # > > >>> > > >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > > >>> > > >>> > > >>> > > >>> Any advice, clues, hints, moral support, etc.... welcomed and > > >> appreciated. > > >>> > > >>> > > >>> Thanks, > > >>> > > >>> > > >>> Christina > > >>> > > >>> > > >>> > > >>> > > >>> -- > > >>> Christina A. Salls > > >>> GLERL Computer Group > > >>> help.glerl at noaa.gov > > > > >>> Help Desk x2127 > > >>> Christina.Salls at noaa.gov > > > > >>> Voice Mail 734-741-2446 > > > >>> > > >>> > > >>> > > >>> _______________________________________________ > > >>> torqueusers mailing list > > >>> torqueusers at supercluster.org > > > > > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > >>> > > >> > > >> > > >> > > >> -- > > >> Dr. Axel Kohlmeyer akohlmey at gmail.com > > > > >> http://sites.google.com/site/akohlmey/ > > >> > > >> Institute for Computational Molecular Science Temple University, > > >> Philadelphia PA, USA. > > >> _______________________________________________ > > >> torqueusers mailing list > > >> torqueusers at supercluster.org > > > > > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > > > Help Desk x2127 > > Christina.Salls at noaa.gov > > > > Voice Mail 734-741-2446 > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Thu May 3 15:25:49 2012 From: mej at lbl.gov (Michael Jennings) Date: Thu, 3 May 2012 14:25:49 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: References: Message-ID: <20120503212548.GQ9750@lbl.gov> On Thursday, 03 May 2012, at 17:18:57 (+0000), Edsall, William (WJ) wrote: > If anyone has suggestions for more useful 'checks' please > comment. These are pretty custom to our environment but I'd like to > hear of more. I tried the program from warewulf, but it wasn't > working well for me. If you can give more specific details on what "wasn't working well for me" means, I'd be happy to help. :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From mej at lbl.gov Thu May 3 17:07:42 2012 From: mej at lbl.gov (Michael Jennings) Date: Thu, 3 May 2012 16:07:42 -0700 Subject: [torqueusers] Warewulf NHC Documentation Updated Message-ID: <20120503230742.GS9750@lbl.gov> For those of you who attended the TORQUE 4.0 talk at MoabCon, I promised to add content to the NHC documentation for writing your own checks. This is now complete, so I'd really like to get some feedback on both the documentation and on NHC itself. As always, the wiki page can be found here: http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check We also have a Google+ presence, so feel free to add the Warewulf page to your Circles if you're a Google+ user: https://plus.google.com/101880784520844026200 Feedback is welcome on the TORQUE or Warewulf mailing lists, via G+, or direct e-mail. :-) Thanks and enjoy! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From s.breedveld at erasmusmc.nl Fri May 4 02:30:41 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Fri, 04 May 2012 10:30:41 +0200 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources In-Reply-To: <2445104.utCSrFlz9W@newton.cc.uit.no> References: <4FA133F2.80303@erasmusmc.nl> <2445104.utCSrFlz9W@newton.cc.uit.no> Message-ID: <4FA393B1.5000108@erasmusmc.nl> Dear Roy, On 05/03/2012 01:51 PM, Roy Dragseth wrote: > It seems like you are trying to submit a job wich requires 15 GB pvmem on a > nodes that have less than 3 GB total virtual memory. Maui also considers > specified memory parameters when scheduling jobs. Thank you for the hint. I have now set the default resources for memory to 1b in the queue: resources_max.ncpus = 8 resources_max.nodect = 10 resources_max.nodes = 2 resources_min.ncpus = 0 resources_default.mem = 1b resources_default.ncpus = 1 resources_default.neednodes = 1:ppn=1 resources_default.nodect = 1 resources_default.nodes = 1 resources_default.pvmem = 1b and additionally submitted a job setting low requirements: $ qsub -q batch -l pvmem=1,mem=1 test-script.sh Unfortunately, the job still does not run: # checkjob -v 61 checking job 61 (RM job '61.testing.azr.nl') State: Idle EState: Deferred Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT WallTime: 00:00:00 of 6:00:00 SubmitTime: Fri May 4 10:27:03 (Time Queued Total: 00:01:34 Eligible: 00:00:01) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 1M Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 1M SWAP: 1M NodeAccess: SHARED NodeCount: 1 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL]u Flags: RESTARTABLE job is deferred. Reason: NoResources (cannot create reservation for job '61' (intital reservation attempt) ) Holds: Defer (hold reason: NoResources) PE: 1.00 StartPriority: 1 cannot select job 61 for partition DEFAULT (job hold active) Any other hints? Thanks in advance, Sebastiaan > r. > > > On Wednesday, May 02, 2012 15:17:38 Sebastiaan Breedveld wrote: >> Dear list, >> >> This may be a Maui issue, but the Maui list seems dead :( >> >> >> I am trying to setup a very basic Torque+Maui system. I am running a >> Torque cluster for a year now, and wanted to improve the scheduling with >> Maui. To this end, I installed a fresh test-system, with server and node >> on a single computer. >> >> Torque version: 2.4.16 >> Maui version: 3.3.1 >> uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 >> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >> I was able to run (simple) jobs with the Torque scheduler. When I >> replaced the scheduler with Maui, jobs stay queued. Jobs are submitted by: >> >> $ qsub -q batch test-script.sh >> >> where test-script.sh is nothing more than a 'sleep 1m' script. Checking >> the job: >> >> # checkjob -v 55 >> checking job 55 (RM job '55.testing.azr.nl') >> >> State: Idle EState: Deferred >> Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT >> WallTime: 00:00:00 of 6:00:00 >> SubmitTime: Thu Apr 5 13:21:33 >> (Time Queued Total: 00:00:32 Eligible: 00:00:01) >> >> Total Tasks: 1 >> >> Req[0] TaskCount: 1 Partition: ALL >> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 15G >> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >> Exec: '' ExecSize: 0 ImageSize: 0 >> Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G >> NodeAccess: SHARED >> NodeCount: 1 >> >> >> IWD: [NONE] Executable: [NONE] >> Bypass: 0 StartCount: 0 >> PartitionMask: [ALL] >> Flags: RESTARTABLE >> >> job is deferred. Reason: NoResources (cannot create reservation for >> job '55' (intital reservation attempt) >> ) >> Holds: Defer (hold reason: NoResources) >> PE: 16.03 StartPriority: 1 >> cannot select job 55 for partition DEFAULT (job hold active) >> >> >> >> show that there are no resources available. The node is free, and unloaded: >> >> # checknode testing >> >> >> checking node testing.azr.nl >> >> State: Idle (in current state for 2:23:54) >> Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M >> Utilized Resources: SWAP: 149M >> Dedicated Resources: [NONE] >> Opsys: linux Arch: [NONE] >> Speed: 1.00 Load: 0.050 >> Network: [DEFAULT] >> Features: [NONE] >> Attributes: [Batch] >> Classes: [batch 2:2] >> >> Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%) >> >> Reservations: >> NOTE: no reservations on node >> >> >> >> When the job is added, maui.log shows this: >> 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0) >> 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) >> 04/05 13:21:34 INFO: processing node request line '1' >> 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >> 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan >> 21600 Idle 0 1333624893 [NONE] [NONE] [NONE]>= 0>= 0 >> [1][ppn=1] 1333624894 >> 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING >> 04/05 13:21:34 INFO: jobs detected: 12 >> 04/05 13:21:34 MStatClearUsage(node,Active) >> 04/05 13:21:34 MClusterUpdateNodeState() >> 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) >> 04/05 13:21:34 INFO: job '40' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '41' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '42' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '44' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '45' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '47' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '48' Priority: 16 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '49' Priority: 12 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '52' Priority: 8 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '53' Priority: 1 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '54' Priority: 60 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '55' Priority: 1 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 MStatClearUsage([NONE],Active) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] >> 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) >> 04/05 13:21:34 INFO: job '40' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '41' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '42' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '44' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '45' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '47' Priority: 22 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '48' Priority: 16 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '49' Priority: 12 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '52' Priority: 8 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '53' Priority: 1 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '54' Priority: 60 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 INFO: job '55' Priority: 1 >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: >> 0(00.0) >> 04/05 13:21:34 MStatClearUsage([NONE],Idle) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 MResDestroy(NULL) >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] >> 04/05 13:21:34 >> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 >> 04/05 13:21:34 MQueueScheduleRJobs(Q) >> 04/05 13:21:34 >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 >> 04/05 13:21:34 >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) >> 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1 >> 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) >> 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in >> partition DEFAULT (1 Needed) >> 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) >> 04/05 13:21:34 MJobReserve(55,Priority) >> 04/05 13:21:34 ALERT: job 55 cannot run in any partition >> 04/05 13:21:34 ALERT: cannot create new reservation for job 55 >> (shape[1] 1) >> 04/05 13:21:34 ALERT: cannot create new reservation for job 55 >> 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create >> reservation for job '55' (intital reservation attempt) >> ) >> 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600 >> seconds) >> 04/05 13:21:34 WARNING: cannot reserve priority job '55' >> Active Jobs------ >> ------------------ >> 04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: 2 >> 04/05 13:21:34 >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) >> 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1 >> [EState: 1] >> 04/05 13:21:34 >> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 >> [EState: 1] >> 04/05 13:21:34 >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 >> [EState: 1] >> 04/05 13:21:34 MSchedUpdateStats() >> 04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 seconds >> 04/05 13:21:34 MResUpdateStats() >> 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% >> active jobs: 0 of 2 (completed: 1) >> 04/05 13:21:34 MQueueCheckStatus() >> 04/05 13:21:34 MNodeCheckStatus() >> 04/05 13:21:34 MUClearChild(PID) >> 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds >> >> >> I think the relevant line is: >> 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in >> partition DEFAULT (1 Needed) >> >> but I have no idea how to make a feasible task for the job. I have tried >> queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem >> to have had effect. >> >> >> >> Torque config. I have tried setting different attributes to the queue >> properties, hoping that it would have some effect: >> # qmgr -c "p s" >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch Priority = 20 >> set queue batch max_running = 8 >> set queue batch resources_max.ncpus = 8 >> set queue batch resources_max.nodect = 10 >> set queue batch resources_max.nodes = 2 >> set queue batch resources_min.ncpus = 0 >> set queue batch resources_default.mem = 2000mb >> set queue batch resources_default.ncpus = 1 >> set queue batch resources_default.neednodes = 1:ppn=1 >> set queue batch resources_default.nodect = 1 >> set queue batch resources_default.nodes = 1 >> set queue batch resources_default.pvmem = 16000mb >> set queue batch resources_default.walltime = 06:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = testing.azr.nl >> set server log_events = 511 >> set server mail_from = adm >> set server resources_available.nodect = 10 >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server next_job_number = 56 >> >> >> Maui configuration, untouched: >> # maui.cfg 3.3.1 >> >> SERVERHOST testing >> # primary admin must be first in list >> ADMIN1 root >> >> # Resource Manager Definition >> >> RMCFG[TESTING] TYPE=PBS >> >> # Allocation Manager Definition >> >> AMCFG[bank] TYPE=NONE >> >> # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html >> # use the 'schedctl -l' command to display current configuration >> >> RMPOLLINTERVAL 00:00:30 >> >> SERVERPORT 42559 >> SERVERMODE NORMAL >> >> # Admin: http://supercluster.org/mauidocs/a.esecurity.html >> >> >> LOGFILE maui.log >> LOGFILEMAXSIZE 10000000 >> LOGLEVEL 3 >> >> # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html >> >> QUEUETIMEWEIGHT 1 >> >> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html >> >> #FSPOLICY PSDEDICATED >> #FSDEPTH 7 >> #FSINTERVAL 86400 >> #FSDECAY 0.80 >> >> # Throttling Policies: >> http://supercluster.org/mauidocs/6.2throttlingpolicies.html >> >> # NONE SPECIFIED >> >> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html >> >> BACKFILLPOLICY FIRSTFIT >> RESERVATIONPOLICY CURRENTHIGHEST >> >> # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html >> >> NODEALLOCATIONPOLICY MINRESOURCE >> >> # QOS: http://supercluster.org/mauidocs/7.3qos.html >> >> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB >> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE >> >> # Standing Reservations: >> http://supercluster.org/mauidocs/7.1.3standingreservations.html >> >> # SRSTARTTIME[test] 8:00:00 >> # SRENDTIME[test] 17:00:00 >> # SRDAYS[test] MON TUE WED THU FRI >> # SRTASKCOUNT[test] 20 >> # SRMAXTIME[test] 0:30:00 >> >> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html >> >> # USERCFG[DEFAULT] FSTARGET=25.0 >> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- >> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi >> # CLASSCFG[batch] FLAGS=PREEMPTEE >> # CLASSCFG[interactive] FLAGS=PREEMPTOR >> >> >> >> Any ideas? >> >> Thanks in advance, >> Sebastiaan From joel.schaerer at gmail.com Fri May 4 07:43:17 2012 From: joel.schaerer at gmail.com (Joel Schaerer) Date: Fri, 04 May 2012 15:43:17 +0200 Subject: [torqueusers] Reserving memory for a job In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010503312ABD@exvic-mbx04.nexus.csiro.au> References: <4FA0F105.1080807@gmail.com> <007DECE986B47F4EABF823C1FBB19C62010503312ABD@exvic-mbx04.nexus.csiro.au> Message-ID: <4FA3DCF5.6040205@gmail.com> Hi Gareth, thanks for your answer! I just tried with vmem, but I unfortunately have the same problem. It seems to the mem and vmem options are simply hard limits on the amount of memory a job can use, and will kill the job if it goes over (which I don't want). Unfortunately they don't seem to be connected to any resource management logic. I still have to try the solution based on queue options suggested by Sreedhar Manchu. joel Gareth.Williams at csiro.au wrote: > Hi Joel, > > I only have time to make a brief answer so I'm sending it direct. Try vmem instead. You will find info on this in old mailing list threads. The scheduler you are using may also matter. > > Gareth > > > >> -----Original Message----- >> From: Joel Schaerer [mailto:joel.schaerer at gmail.com] >> Sent: Wednesday, 2 May 2012 6:32 PM >> To: torqueusers at supercluster.org >> Subject: [torqueusers] Reserving memory for a job >> >> Hello all, >> >> I have the following problem: some of my jobs tend to use a large >> amount >> of RAM. I don't want my nodes to start swapping, so the batch manager >> should avoid running too many of these jobs on one node, independently >> of the number of cores available. As an example, if "zalasta" has 16GB >> of RAM, only two of the following jobs should run concurrently, even if >> the machine has 8 cores: >> >> echo "sleep 120" | qsub -N test -lnodes=zalasta,mem=16gb >> >> Unfortunately, torque will run 8 jobs concurrently, even if there is >> not >> enough memory. >> >> So, is there a way to treat available RAM as a resource with >> PBS/Torque? >> Or should I approach the problem in another way? >> >> Thanks, >> >> joel >> >> PS: I have a qsub version 4.0.0, if that matters. >> > > From jujj603 at gmail.com Thu May 3 20:15:15 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 4 May 2012 10:15:15 +0800 Subject: [torqueusers] meaning of output of qstat -Qf Message-ID: Hi all: I have some questions about the meaning of output of qstat -Qf. I checked pbs ers, but found nothing in detail. Here is my output?I write questions in comments at the end of lines): [Fri May 04 09:53:57]user39 at m1:mpich2-1.4.1p1$ qstat -Qf Queue: batch queue_type = Execution max_user_queuable = 6000 # What does this mean? total_jobs = 4 state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:2 Exiting:0 resources_min.nodect = 1 # What does this mean? resources_default.nodes = 80 # Does it mean the default nodes assigned to a new job is 80 ? resources_default.walltime = 720:00:00 # Does it mean the default walltime assigned to a new job is 720:00:00 ? mtime = 1331543014 # What does this mean? resources_assigned.nodect = 20 # What does this mean ? max_user_run = 10 # What does this mean? enabled = True started = True Any reply will be appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/14982bc6/attachment-0001.html From jujj603 at gmail.com Thu May 3 21:11:31 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 4 May 2012 11:11:31 +0800 Subject: [torqueusers] meaning of output of qstat -Qf In-Reply-To: References: Message-ID: I find it here http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml#acl_group_enable On Fri, May 4, 2012 at 10:15 AM, Ju JiaJia wrote: > Hi all: > > I have some questions about the meaning of output of qstat -Qf. I checked > pbs ers, but found nothing in detail. Here is my output?I write questions > in comments at the end of lines): > [Fri May 04 09:53:57]user39 at m1:mpich2-1.4.1p1$ qstat -Qf > Queue: batch > queue_type = Execution > max_user_queuable = 6000 # What does this > mean? > total_jobs = 4 > state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:2 Exiting:0 > resources_min.nodect = 1 # What does > this mean? > resources_default.nodes = 80 # Does it mean > the default nodes assigned to a new job is 80 ? > resources_default.walltime = 720:00:00 # Does it mean the > default walltime assigned to a new job is 720:00:00 ? > mtime = 1331543014 # What does > this mean? > resources_assigned.nodect = 20 # What does this > mean ? > max_user_run = 10 # What does > this mean? > enabled = True > started = True > > Any reply will be appreciated. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/abd8ea26/attachment-0001.html From jujj603 at gmail.com Thu May 3 23:22:45 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 4 May 2012 13:22:45 +0800 Subject: [torqueusers] meaning of output of qstat -Qf In-Reply-To: References: Message-ID: Hi I checked the documentation, but i still don't know what resources_min.nodect mean, Can anyone explain this ? And What's the difference between nodect and nodes ? Any reply will be appreciated. On Fri, May 4, 2012 at 11:11 AM, Ju JiaJia wrote: > I find it here > http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml#acl_group_enable > > On Fri, May 4, 2012 at 10:15 AM, Ju JiaJia wrote: > >> Hi all: >> >> I have some questions about the meaning of output of qstat -Qf. I checked >> pbs ers, but found nothing in detail. Here is my output?I write questions >> in comments at the end of lines): >> [Fri May 04 09:53:57]user39 at m1:mpich2-1.4.1p1$ qstat -Qf >> Queue: batch >> queue_type = Execution >> max_user_queuable = 6000 # What does this >> mean? >> total_jobs = 4 >> state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:2 Exiting:0 >> resources_min.nodect = 1 # What >> does this mean? >> resources_default.nodes = 80 # Does it mean >> the default nodes assigned to a new job is 80 ? >> resources_default.walltime = 720:00:00 # Does it mean the >> default walltime assigned to a new job is 720:00:00 ? >> mtime = 1331543014 # What does >> this mean? >> resources_assigned.nodect = 20 # What does this >> mean ? >> max_user_run = 10 # What >> does this mean? >> enabled = True >> started = True >> >> Any reply will be appreciated. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/464ad7c9/attachment-0001.html From jujj603 at gmail.com Fri May 4 05:33:01 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 4 May 2012 19:33:01 +0800 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources In-Reply-To: <4FA393B1.5000108@erasmusmc.nl> References: <4FA133F2.80303@erasmusmc.nl> <2445104.utCSrFlz9W@newton.cc.uit.no> <4FA393B1.5000108@erasmusmc.nl> Message-ID: How many processes and nodes used in you test-script.sh ? Check maui.cfg to see whether you have enough right to run multi processes. If you don't, you job will just hang and without any error messages. In maui.cfg, something like this: USERCFG[your-user-name] MAXJOB=1 MAXQUEUEDJOB=1 MAXNODE=1 MAXPROC=1 and default you can only run one process on one node. On Fri, May 4, 2012 at 4:30 PM, Sebastiaan Breedveld < s.breedveld at erasmusmc.nl> wrote: > Dear Roy, > > On 05/03/2012 01:51 PM, Roy Dragseth wrote: > > It seems like you are trying to submit a job wich requires 15 GB pvmem > on a > > nodes that have less than 3 GB total virtual memory. Maui also considers > > specified memory parameters when scheduling jobs. > Thank you for the hint. I have now set the default resources for memory > to 1b in the queue: > > resources_max.ncpus = 8 > resources_max.nodect = 10 > resources_max.nodes = 2 > resources_min.ncpus = 0 > resources_default.mem = 1b > resources_default.ncpus = 1 > resources_default.neednodes = 1:ppn=1 > resources_default.nodect = 1 > resources_default.nodes = 1 > resources_default.pvmem = 1b > > and additionally submitted a job setting low requirements: > > $ qsub -q batch -l pvmem=1,mem=1 test-script.sh > > Unfortunately, the job still does not run: > > # checkjob -v 61 > > > checking job 61 (RM job '61.testing.azr.nl') > > State: Idle EState: Deferred > Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT > WallTime: 00:00:00 of 6:00:00 > SubmitTime: Fri May 4 10:27:03 > (Time Queued Total: 00:01:34 Eligible: 00:00:01) > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 1M > Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 MEM: 1M SWAP: 1M > NodeAccess: SHARED > NodeCount: 1 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL]u > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation for > job '61' (intital reservation attempt) > ) > Holds: Defer (hold reason: NoResources) > PE: 1.00 StartPriority: 1 > cannot select job 61 for partition DEFAULT (job hold active) > > > Any other hints? > > Thanks in advance, > Sebastiaan > > > > r. > > > > > > On Wednesday, May 02, 2012 15:17:38 Sebastiaan Breedveld wrote: > >> Dear list, > >> > >> This may be a Maui issue, but the Maui list seems dead :( > >> > >> > >> I am trying to setup a very basic Torque+Maui system. I am running a > >> Torque cluster for a year now, and wanted to improve the scheduling with > >> Maui. To this end, I installed a fresh test-system, with server and node > >> on a single computer. > >> > >> Torque version: 2.4.16 > >> Maui version: 3.3.1 > >> uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 16:42:26 > >> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > >> > >> > >> > >> I was able to run (simple) jobs with the Torque scheduler. When I > >> replaced the scheduler with Maui, jobs stay queued. Jobs are submitted > by: > >> > >> $ qsub -q batch test-script.sh > >> > >> where test-script.sh is nothing more than a 'sleep 1m' script. Checking > >> the job: > >> > >> # checkjob -v 55 > >> checking job 55 (RM job '55.testing.azr.nl') > >> > >> State: Idle EState: Deferred > >> Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT > >> WallTime: 00:00:00 of 6:00:00 > >> SubmitTime: Thu Apr 5 13:21:33 > >> (Time Queued Total: 00:00:32 Eligible: 00:00:01) > >> > >> Total Tasks: 1 > >> > >> Req[0] TaskCount: 1 Partition: ALL > >> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 15G > >> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > >> Exec: '' ExecSize: 0 ImageSize: 0 > >> Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G > >> NodeAccess: SHARED > >> NodeCount: 1 > >> > >> > >> IWD: [NONE] Executable: [NONE] > >> Bypass: 0 StartCount: 0 > >> PartitionMask: [ALL] > >> Flags: RESTARTABLE > >> > >> job is deferred. Reason: NoResources (cannot create reservation for > >> job '55' (intital reservation attempt) > >> ) > >> Holds: Defer (hold reason: NoResources) > >> PE: 16.03 StartPriority: 1 > >> cannot select job 55 for partition DEFAULT (job hold active) > >> > >> > >> > >> show that there are no resources available. The node is free, and > unloaded: > >> > >> # checknode testing > >> > >> > >> checking node testing.azr.nl > >> > >> State: Idle (in current state for 2:23:54) > >> Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M > >> Utilized Resources: SWAP: 149M > >> Dedicated Resources: [NONE] > >> Opsys: linux Arch: [NONE] > >> Speed: 1.00 Load: 0.050 > >> Network: [DEFAULT] > >> Features: [NONE] > >> Attributes: [Batch] > >> Classes: [batch 2:2] > >> > >> Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 (0.10%) > >> > >> Reservations: > >> NOTE: no reservations on node > >> > >> > >> > >> When the job is added, maui.log shows this: > >> 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl,J,TaskList,0) > >> 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) > >> 04/05 13:21:34 INFO: processing node request line '1' > >> 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) > >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > >> 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan > >> 21600 Idle 0 1333624893 [NONE] [NONE] [NONE]>= 0>= 0 > >> [1][ppn=1] 1333624894 > >> 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING > >> 04/05 13:21:34 INFO: jobs detected: 12 > >> 04/05 13:21:34 MStatClearUsage(node,Active) > >> 04/05 13:21:34 MClusterUpdateNodeState() > >> 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) > >> 04/05 13:21:34 INFO: job '40' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '41' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '42' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '44' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '45' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '47' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '48' Priority: 16 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '49' Priority: 12 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '52' Priority: 8 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '53' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '54' Priority: 60 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '55' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 MStatClearUsage([NONE],Active) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] > >> 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) > >> 04/05 13:21:34 INFO: job '40' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '41' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '42' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '44' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '45' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '47' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '48' Priority: 16 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '49' Priority: 12 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '52' Priority: 8 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '53' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '54' Priority: 60 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '55' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 MStatClearUsage([NONE],Idle) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 [EState: 11] > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 > >> 04/05 13:21:34 MQueueScheduleRJobs(Q) > >> 04/05 13:21:34 > >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 1/1 > >> 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) > >> 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in > >> partition DEFAULT (1 Needed) > >> 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) > >> 04/05 13:21:34 MJobReserve(55,Priority) > >> 04/05 13:21:34 ALERT: job 55 cannot run in any partition > >> 04/05 13:21:34 ALERT: cannot create new reservation for job 55 > >> (shape[1] 1) > >> 04/05 13:21:34 ALERT: cannot create new reservation for job 55 > >> 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create > >> reservation for job '55' (intital reservation attempt) > >> ) > >> 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for 3600 > >> seconds) > >> 04/05 13:21:34 WARNING: cannot reserve priority job '55' > >> Active Jobs------ > >> ------------------ > >> 04/05 13:21:34 INFO: resources available after scheduling: N: 1 P: > 2 > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition DEFAULT: 0/1 > >> [EState: 1] > >> 04/05 13:21:34 > >> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 > >> [EState: 1] > >> 04/05 13:21:34 > >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 > >> [EState: 1] > >> 04/05 13:21:34 MSchedUpdateStats() > >> 04/05 13:21:34 INFO: iteration: 288 scheduling time: 0.002 > seconds > >> 04/05 13:21:34 MResUpdateStats() > >> 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% > >> active jobs: 0 of 2 (completed: 1) > >> 04/05 13:21:34 MQueueCheckStatus() > >> 04/05 13:21:34 MNodeCheckStatus() > >> 04/05 13:21:34 MUClearChild(PID) > >> 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds > >> > >> > >> I think the relevant line is: > >> 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in > >> partition DEFAULT (1 Needed) > >> > >> but I have no idea how to make a feasible task for the job. I have tried > >> queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but none seem > >> to have had effect. > >> > >> > >> > >> Torque config. I have tried setting different attributes to the queue > >> properties, hoping that it would have some effect: > >> # qmgr -c "p s" > >> # > >> # Create queues and set their attributes. > >> # > >> # > >> # Create and define queue batch > >> # > >> create queue batch > >> set queue batch queue_type = Execution > >> set queue batch Priority = 20 > >> set queue batch max_running = 8 > >> set queue batch resources_max.ncpus = 8 > >> set queue batch resources_max.nodect = 10 > >> set queue batch resources_max.nodes = 2 > >> set queue batch resources_min.ncpus = 0 > >> set queue batch resources_default.mem = 2000mb > >> set queue batch resources_default.ncpus = 1 > >> set queue batch resources_default.neednodes = 1:ppn=1 > >> set queue batch resources_default.nodect = 1 > >> set queue batch resources_default.nodes = 1 > >> set queue batch resources_default.pvmem = 16000mb > >> set queue batch resources_default.walltime = 06:00:00 > >> set queue batch enabled = True > >> set queue batch started = True > >> # > >> # Set server attributes. > >> # > >> set server scheduling = True > >> set server acl_hosts = testing.azr.nl > >> set server log_events = 511 > >> set server mail_from = adm > >> set server resources_available.nodect = 10 > >> set server scheduler_iteration = 600 > >> set server node_check_rate = 150 > >> set server tcp_timeout = 6 > >> set server next_job_number = 56 > >> > >> > >> Maui configuration, untouched: > >> # maui.cfg 3.3.1 > >> > >> SERVERHOST testing > >> # primary admin must be first in list > >> ADMIN1 root > >> > >> # Resource Manager Definition > >> > >> RMCFG[TESTING] TYPE=PBS > >> > >> # Allocation Manager Definition > >> > >> AMCFG[bank] TYPE=NONE > >> > >> # full parameter docs at > http://supercluster.org/mauidocs/a.fparameters.html > >> # use the 'schedctl -l' command to display current configuration > >> > >> RMPOLLINTERVAL 00:00:30 > >> > >> SERVERPORT 42559 > >> SERVERMODE NORMAL > >> > >> # Admin: http://supercluster.org/mauidocs/a.esecurity.html > >> > >> > >> LOGFILE maui.log > >> LOGFILEMAXSIZE 10000000 > >> LOGLEVEL 3 > >> > >> # Job Priority: > http://supercluster.org/mauidocs/5.1jobprioritization.html > >> > >> QUEUETIMEWEIGHT 1 > >> > >> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > >> > >> #FSPOLICY PSDEDICATED > >> #FSDEPTH 7 > >> #FSINTERVAL 86400 > >> #FSDECAY 0.80 > >> > >> # Throttling Policies: > >> http://supercluster.org/mauidocs/6.2throttlingpolicies.html > >> > >> # NONE SPECIFIED > >> > >> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > >> > >> BACKFILLPOLICY FIRSTFIT > >> RESERVATIONPOLICY CURRENTHIGHEST > >> > >> # Node Allocation: > http://supercluster.org/mauidocs/5.2nodeallocation.html > >> > >> NODEALLOCATIONPOLICY MINRESOURCE > >> > >> # QOS: http://supercluster.org/mauidocs/7.3qos.html > >> > >> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > >> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > >> > >> # Standing Reservations: > >> http://supercluster.org/mauidocs/7.1.3standingreservations.html > >> > >> # SRSTARTTIME[test] 8:00:00 > >> # SRENDTIME[test] 17:00:00 > >> # SRDAYS[test] MON TUE WED THU FRI > >> # SRTASKCOUNT[test] 20 > >> # SRMAXTIME[test] 0:30:00 > >> > >> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > >> > >> # USERCFG[DEFAULT] FSTARGET=25.0 > >> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > >> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > >> # CLASSCFG[batch] FLAGS=PREEMPTEE > >> # CLASSCFG[interactive] FLAGS=PREEMPTOR > >> > >> > >> > >> Any ideas? > >> > >> Thanks in advance, > >> Sebastiaan > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/f8a83974/attachment-0001.html From knielson at adaptivecomputing.com Fri May 4 10:26:15 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 4 May 2012 10:26:15 -0600 Subject: [torqueusers] meaning of output of qstat -Qf In-Reply-To: References: Message-ID: On Thu, May 3, 2012 at 11:22 PM, Ju JiaJia wrote: > Hi > I checked the documentation, but i still don't know what > resources_min.nodect mean, Can anyone explain this ? > And What's the difference between nodect and nodes ? > > Any reply will be appreciated. > > resources_min.nodect sets the minimum number of nodes need in a job before the job can be eligible for the queue. Nodes are different then procs. A node is a host. So for example qsub -l nodes=2:ppn=4 job.sh requests 2 nodes. The nodect is 2. Total procs or cores for the job is 4. (2 nodes X 2 procs/node). Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/9eef5f70/attachment.html From gas5x at yahoo.com Fri May 4 10:33:49 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Fri, 4 May 2012 09:33:49 -0700 (PDT) Subject: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper In-Reply-To: Message-ID: <1336149229.98460.YahooMailClassic@web111302.mail.gq1.yahoo.com> Hi All, This is a very interesting discussion. In the latest versions of Intel MPI, (4.0 update 1 to update 3) there is mpiexec.hydra launcher. According to the documentation, it takes "Resource Management Kernel" switch with one option (PBS) like this: "mpiexec.hydra -rmk pbs". I haven't tested it very thoroughly; but it seem to spawn processes from under Torque. Does anyone have experience in using it? Intel's documentation is very brief; it doesn't really explain what it is. Is the -rmk pbs an native Intel analog of OSC mpiexec? -- Grigory Shamov HPC Analyst University of Manitoba --- On Wed, 5/2/12, Doug Johnson wrote: > From: Doug Johnson > Subject: Re: [torqueusers] Starting intel mpi tasks in torque - pbsssh wrapper > To: "Torque Users Mailing List" > Date: Wednesday, May 2, 2012, 5:15 AM > Hi Gareth, > > For Intel MPI, you also have the option of using OSC > mpiexec.? Just > use '-comm pmi', which is the same startup mechanism as > MPICH2.? When > using newer versions of the Intel MPI library with OSC > mpiexec, just > make sure to have I_MPI_PMI_EXTENSIONS=on. > > A caveat for using OSC mpiexec is that the startup programs > included > with Intel MPI add extra environment variables that > influences runtime > behavior of the MPI program.? These may be useful. > > We have a similar script as yours below that we use with > Platform/HP > MPI.? See attached file.? It copes with the fact > that some packages > use IP addresses on the rsh command line. > > Perhaps the correct solution is to create a proper pbsrsh > that accepts > the same command line flags as rsh, and is more > robust.? This would > also allow MPI libraries such as Intel's and Platform's to > have better > Torque integration, with no Torque library dependencies, and > without > relying on external programs such as OSC mpiexec. > > Doug > > > > > At Wed, 2 May 2012 14:12:45 +1000, > > wrote: > > > > [1? ] > > [1.1? (quoted-printable)>] > > > > [1.2? (quoted-printable)>] > > Hi All, > > > > I just figured this out after debugging some problems > with left over processes from intel mpi and you > > may find it useful > > > > Recent versions of intel? mpi use a variable > I_MPI_MPD_RSH to specify the program (ssh) used to launch > > the mpd processes (maybe old versions have this > too).? The program needs to return some info on > stdout. > >? I had to modify an existing pbsdsh wrapper to add > the pbsdsh ?o option to make it all work. > > > > The wrapper is: > > > > ~/sysadmin/ascutils/common> cat pbsssh > > > > #!/bin/bash > > > > # $Id: pbsssh 2236 2012-05-02 03:16:17Z wil240 $ > > > > # $HeadURL: > svn+ssh://stream/cs/home/svn/sysadmin/ascutils/common/pbsssh > $ > > > > usage="usage: $0 " > > > > #swallow -x -n and -q (for intel mpi) > > > > while getopts "xqn" opt > > > > do > > > > : > > > > done > > > > shift $((OPTIND-1)) > > > > if [ $# -lt 2 ] > > > > then > > > >? ? ? ???echo $usage > > > >? ? ? ???exit > > > > fi > > > > node=$1 > > > > shift > > > > exec pbsdsh -o -h $node "$@" > > > > And the before and after behavior is illustrated here > (note that the mpi tasks now run in a cpuset > > under the control of torque): > > > > wil240 at n001:~> mpirun -n 2 sh -c 'echo -n `hostname` > |? cat - /proc/$$/cpuset' > > > > n002/ > > > > n001/torque/1092920.burnet-srv.idpx.hpsc.csiro.au > > > > wil240 at n001:~> export I_MPI_MPD_RSH=pbsssh > > > > wil240 at n001:~> mpirun -n 2 sh -c 'echo -n `hostname` > |? cat - /proc/$$/cpuset' > > > > n001/torque/1092920.burnet-srv.idpx.hpsc.csiro.au > > > > n002/torque/1092920.burnet-srv.idpx.hpsc.csiro.au > > > > We?ll be updating out intel-mpi environment module to > set I_MPI_MPD_RSH=pbsssh > > > > feedback is welcome. > > > > Gareth > > > > > > [2? ] > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -----Inline Attachment Follows----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From gas5x at yahoo.com Fri May 4 10:42:47 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Fri, 4 May 2012 09:42:47 -0700 (PDT) Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120503212548.GQ9750@lbl.gov> Message-ID: <1336149767.21114.YahooMailClassic@web111309.mail.gq1.yahoo.com> Dear Michael, Thank you for making the NHC available to the public, and well documented. I dont know what the users' problem with it; but I could guess that it perhaps didnt work "out of the box". First, the nhc.conf has only examples, which of course always fail. But this is well documented already. Second, the helper scripts that online/offline the nodes have assumptions on where the pbsnodes command is and which user runs it; which in most cases have to be changed. On most of the Rocks installations, for example, Torque lives under /opt/torque, not /usr/local/torque. For me the NHC worked after I fixed these (trivial) things; I also use not 'root' but a dummy user to call pbsnodes from helper scripts (making the dummy user the operator for local cluster network). -- Grigory Shamov HPC Analyst, University of Manitoba --- On Thu, 5/3/12, Michael Jennings wrote: > From: Michael Jennings > Subject: Re: [torqueusers] strategies for bad nodes > To: torqueusers at supercluster.org > Date: Thursday, May 3, 2012, 2:25 PM > On Thursday, 03 May 2012, at 17:18:57 > (+0000), > Edsall, William (WJ) wrote: > > > If anyone has suggestions for more useful 'checks' > please > > comment. These are pretty custom to our environment but > I'd like to > > hear of more. I tried the program from warewulf, but it > wasn't > > working well for me. > > If you can give more specific details on what "wasn't > working well for > me" means, I'd be happy to help.? :-) > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E? ? ? ? W: 510-495-2687 > MS 050B-3209? ? ? ? ? F: > 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From knielson at adaptivecomputing.com Fri May 4 11:32:06 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 4 May 2012 11:32:06 -0600 Subject: [torqueusers] TORQUE 4.0.1 available. Message-ID: Hi all, TORQUE 4.0.1 is now available for general use. The biggest change between 4.0.0 and 4.0.1 is the way TORQUE handles new requests on port 15001 in pbs_server. Previously, new requests coming to port 15001 had to wait for a select to be called in wait_request which is part of the main_loop function on the server. This allowed only one connect request at a time to be serviced each time through the loop. This has been replaced by a listening thread which accepts each new connection as it happens and then creates a new thread to process the incoming request. This dramatically increases the number of new requests which can be handled by TORQUE. Aside from fixing several bugs which can be found in the CHANGELOG considerable code refactoring has been done with 4.0.1 to improve scalability, performance and reliability. Thanks to everyone who has helped to make this release possible. As always community input is welcomed and necessary to make TORQUE a successful open-source project. The new code can be downloaded from the new TORQUE page at http://www.adaptivecomputing.com/support/download-center/torque-download/ Find *TORQUE Downloads* and *TORQUE 4.0.x*. Then click *Get download*. The file to choose is torque-4.0.1.tar.gz. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/439f8b3f/attachment.html From knielson at adaptivecomputing.com Fri May 4 11:50:25 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 4 May 2012 11:50:25 -0600 Subject: [torqueusers] TORQUE 3.0.5 available Message-ID: Hi all, TORQUE 3.0.5 is now available for general use. There were no new features added to TORQUE 3.0.5. Below is the content of the CHANGELOG for 3.0.5. TORQUE 3.0.x is recommended for sites using NUMA systems similar to SGI's Ultra-violet. The same functionality is now also available in TORQUE 4.0.1 which we recommend for everyone. Regards Ken Nielson Adaptive Computing 3.0.5 Change log b - fix for writing too much data when job_script is saved to job log. b - fix for where pbs_mom would not automatically set gpu mode. b - fix for alligning qstat -r output when configured with -DTXT. e - Change size of transfer block used on job rerun from 4k to 64k. b - With nvidia gpus, TORQUE was losing the directive of what nodes it should run the job on from Moab. Corrected. e - add the $PBS_WALLTIME variable to jobs, thanks to a patch from Mark Roberts n - change moab_array_compatible server parameter so it defaults to true b - fix so comma separated file list can be used with qsub -W stagein/stageout. Matches qsub documentation again. (backported from 4.0.1) e - change to allow pbs_mom to run if configured with --enable-nvidia-gpus but installed on a node without Nvidia gpus. b - fix for where pbsnodes displays outdated gpu_status information. b - Fixed a problem where starting pbs_mom with the default -p option would delete running jobs instead of trying to recover them if cpusets were enabled. TRQ-828 Fixes Bugzilla 174. b - Added a check for a NULL job pointer in modify_job. This fixes TRQ-818 b - fix problem with '+ and segfault when using multiple node gpu requests. (backported from 4.0.1) b - fix problem with mom segfault when using 8 or more gpus on mom node. (backported from 4.0.1) b - Fixed a bug for NUMA cpusets where Nodes were being underallocated for the job. TRQ-845 b - Fixed a problem with cpusets where if a boot cpuset is present under /dev/cpuset torque would only write out 1024 cpus. This was changed from a hardcoded value to be the maximum size found in /dev/cpuset/cpus. TRQ-840 b - Fix so child pbs_mom does not remain running after qdel on slow starting job. TRQ-860. (backported from 4.0.1) b - Fix so pbs_mom won't segfault after a qdel is done for a job that is still running the prologue. TRQ-832. (backported from 4.0.1) b - Fixed yet another case where procct would get promoted to Moab causing jobs to never be run. Added remove_procct to ROUTE_PERM_FAILURE case in default_router. b - Fix so qrun jobid[] does not cause pbs_server segfault. TRQ-865. (backported from 4.0.2) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/5308aebb/attachment-0001.html From bircoph at gmail.com Fri May 4 12:32:02 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Fri, 4 May 2012 22:32:02 +0400 Subject: [torqueusers] How to disenage email notifications? Message-ID: <20120504223202.1d5b76b3.bircoph@gmail.com> Hello, I want to completely disable Torque email notifications for all users. qsub_filter is not an option, because qsub -m CLI argument overrides #PBS directive in the qsub script. Complete MTA removal from the system is also undesired. I can't find any config or qmgr option for this task... I use torque-3.0.5 (just upgraded). Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/54ab0809/attachment.bin From mej at lbl.gov Fri May 4 12:49:51 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 4 May 2012 11:49:51 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <1336149767.21114.YahooMailClassic@web111309.mail.gq1.yahoo.com> References: <20120503212548.GQ9750@lbl.gov> <1336149767.21114.YahooMailClassic@web111309.mail.gq1.yahoo.com> Message-ID: <20120504184950.GW9750@lbl.gov> On Friday, 04 May 2012, at 09:42:47 (-0700), Grigory Shamov wrote: > Thank you for making the NHC available to the public, and well documented. Thank *you* for your very valuable feedback, and for taking the time to troubleshoot the problems you encountered! :-) As a token of appreciation, here's an undocumented tip: Any variable NHC uses, including the path and filename for the configuration file, can be overridden in /etc/sysconfig/nhc. This can be used to change the value of $PATH, alter runtime options, etc. It's sourced, so any valid shell commands can be run there. > I dont know what the users' problem with it; but I could guess that > it perhaps didnt work "out of the box". First, the nhc.conf has only > examples, which of course always fail. But this is well documented > already. Yes, you're right. Do you think it would be better if the examples were all commented out so that the check would "succeed" out-of-the-box? Or would that be less valuable? > Second, the helper scripts that online/offline the nodes have > assumptions on where the pbsnodes command is and which user runs it; > which in most cases have to be changed. On most of the Rocks > installations, for example, Torque lives under /opt/torque, not > /usr/local/torque. Good point. We always use the TORQUE RPMs, so the paths are correct for us, but I certainly realize that many don't. So I've just committed the attached patch which does the following: - Removes debugging stuff for UIDs > 100 so that NHC can potentially be run as non-root in a normal fashion. - Convert node online/offline scripts to use variables and $PATH to identify where the "pbsnodes" command is and what arguments it should take. - Add an "eval" to the execution of the check so that shell variables can be used or altered in config files. With this patch, you can put things in your config file like the following: * || export PBSNODES=/opt/torque/bin/pbsnodes *.testbed || export MARK_OFFLINE=0 or even *.cluster1 || NETDEV=eth0 *.cluster2 || NETDEV=eth1 *.cluster3 || NETDEV=virbr0 *.cluster? || check_hw_eth $NETDEV > For me the NHC worked after I fixed these (trivial) things; I also > use not 'root' but a dummy user to call pbsnodes from helper scripts > (making the dummy user the operator for local cluster network). Something like the following should now work for you out-of-the-box with the attached patch: * || export PBSNODES="su - pbsadmin -c /opt/torque/bin/pbsnodes" Thanks again for your comments! Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 -------------- next part -------------- Index: helpers/node-mark-online =================================================================== --- helpers/node-mark-online (revision 931) +++ helpers/node-mark-online (working copy) @@ -13,8 +13,8 @@ # previously offlined. It will first obtain the current node state # information to avoid onlining nodes it didn't originally offline. -PATH="/sbin:/usr/sbin:/bin:/usr/bin" -PBSNODES="/usr/bin/pbsnodes" +PBSNODES="${PBSNODES:-pbsnodes}" +PBSNODES_ARGS="${PBSNODES_ARGS:--c -N}" LEADER="NHC:" echo "`date '+%Y%m%d %H:%M:%S'` $0 $*" @@ -34,7 +34,7 @@ exit 0 fi echo "$0: Marking $HOSTNAME online and clearing note ($OLD_NOTE_LEADER $OLD_NOTE)" - exec $PBSNODES -c -N '' $HOSTNAME + exec $PBSNODES $PBSNODES_ARGS '' $HOSTNAME ;; esac echo "$0: Skipping $STATUS node $HOSTNAME ($OLD_NOTE_LEADER $OLD_NOTE)" Index: helpers/node-mark-offline =================================================================== --- helpers/node-mark-offline (revision 931) +++ helpers/node-mark-offline (working copy) @@ -14,8 +14,8 @@ # which were not placed by NHC. If these checks pass, the node is # marked offline with the note supplied. -PATH="/sbin:/usr/sbin:/bin:/usr/bin" -PBSNODES="/usr/bin/pbsnodes" +PBSNODES="${PBSNODES:-pbsnodes}" +PBSNODES_ARGS="${PBSNODES_ARGS:--o -N}" LEADER="NHC:" echo "`date '+%Y%m%d %H:%M:%S'` $0 $*" @@ -39,4 +39,4 @@ esac echo "$0: Marking $STATUS $HOSTNAME offline: $LEADER $NOTE" -exec $PBSNODES -o -N "$LEADER $NOTE" $HOSTNAME +exec $PBSNODES $PBSNODES_ARGS "$LEADER $NOTE" $HOSTNAME Index: nhc =================================================================== --- nhc (revision 931) +++ nhc (working copy) @@ -102,15 +102,6 @@ ### Script guts begin here. -# If not root, change config paths for debugging. -if [ $EUID -gt 100 ]; then - CONFDIR="$PWD" - CONFFILE="$CONFDIR/$NAME.conf" - INCDIR="$CONFDIR/scripts" - LOGFILE="" - DEBUG=1 -fi - # Load settings from system-wide location. if [ -f $SYSCONFIGDIR/$NAME ]; then . $SYSCONFIGDIR/$NAME @@ -144,7 +135,7 @@ # Run the check. log "Running check: \"$CHECK\"" - $CHECK + eval $CHECK RET=$? # Check for failure. From s.breedveld at erasmusmc.nl Fri May 4 12:52:49 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Fri, 04 May 2012 20:52:49 +0200 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources In-Reply-To: References: <4FA133F2.80303@erasmusmc.nl> <2445104.utCSrFlz9W@newton.cc.uit.no> <4FA393B1.5000108@erasmusmc.nl> Message-ID: <4FA42581.4010901@erasmusmc.nl> Dear Ju, On 05/04/2012 01:33 PM, Ju JiaJia wrote: > How many processes and nodes used in you test-script.sh ? Nothing special defined: #!/bin/bash # # This is the unique output file for this job OUTPUTFILE=test_output_${PBS_JOBID}.log # Print echo "Output for this job:" > $OUTPUTFILE # Do something sleep 1m # Get a list of environment variables date >> $OUTPUTFILE # Do something more sleep 5s # quit exit 0 > Check maui.cfg to see whether you have enough right to run multi > processes. If you don't, you job will just hang and without any error > messages. > In maui.cfg, something like this: > USERCFG[your-user-name] MAXJOB=1 MAXQUEUEDJOB=1 MAXNODE=1 MAXPROC=1 > and default you can only run one process on one node. > There are no lines like that. The maui.cfg is untouched: # maui.cfg 3.3.1 SERVERHOST testing ADMIN1 root RMCFG[ULURU-TESTING] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE Sincerely, Sebastiaan > On Fri, May 4, 2012 at 4:30 PM, Sebastiaan Breedveld > > wrote: > > Dear Roy, > > On 05/03/2012 01:51 PM, Roy Dragseth wrote: > > It seems like you are trying to submit a job wich requires 15 GB > pvmem on a > > nodes that have less than 3 GB total virtual memory. Maui also > considers > > specified memory parameters when scheduling jobs. > Thank you for the hint. I have now set the default resources for > memory > to 1b in the queue: > > resources_max.ncpus = 8 > resources_max.nodect = 10 > resources_max.nodes = 2 > resources_min.ncpus = 0 > resources_default.mem = 1b > resources_default.ncpus = 1 > resources_default.neednodes = 1:ppn=1 > resources_default.nodect = 1 > resources_default.nodes = 1 > resources_default.pvmem = 1b > > and additionally submitted a job setting low requirements: > > $ qsub -q batch -l pvmem=1,mem=1 test-script.sh > > Unfortunately, the job still does not run: > > # checkjob -v 61 > > > checking job 61 (RM job '61.testing.azr.nl > ') > > State: Idle EState: Deferred > Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT > WallTime: 00:00:00 of 6:00:00 > SubmitTime: Fri May 4 10:27:03 > (Time Queued Total: 00:01:34 Eligible: 00:00:01) > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 1M > Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 MEM: 1M SWAP: 1M > NodeAccess: SHARED > NodeCount: 1 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL]u > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation for > job '61' (intital reservation attempt) > ) > Holds: Defer (hold reason: NoResources) > PE: 1.00 StartPriority: 1 > cannot select job 61 for partition DEFAULT (job hold active) > > > Any other hints? > > Thanks in advance, > Sebastiaan > > > > r. > > > > > > On Wednesday, May 02, 2012 15:17:38 Sebastiaan Breedveld wrote: > >> Dear list, > >> > >> This may be a Maui issue, but the Maui list seems dead :( > >> > >> > >> I am trying to setup a very basic Torque+Maui system. I am > running a > >> Torque cluster for a year now, and wanted to improve the > scheduling with > >> Maui. To this end, I installed a fresh test-system, with server > and node > >> on a single computer. > >> > >> Torque version: 2.4.16 > >> Maui version: 3.3.1 > >> uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue Mar 27 > 16:42:26 > >> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux > >> > >> > >> > >> I was able to run (simple) jobs with the Torque scheduler. When I > >> replaced the scheduler with Maui, jobs stay queued. Jobs are > submitted by: > >> > >> $ qsub -q batch test-script.sh > >> > >> where test-script.sh is nothing more than a 'sleep 1m' script. > Checking > >> the job: > >> > >> # checkjob -v 55 > >> checking job 55 (RM job '55.testing.azr.nl > ') > >> > >> State: Idle EState: Deferred > >> Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT > >> WallTime: 00:00:00 of 6:00:00 > >> SubmitTime: Thu Apr 5 13:21:33 > >> (Time Queued Total: 00:00:32 Eligible: 00:00:01) > >> > >> Total Tasks: 1 > >> > >> Req[0] TaskCount: 1 Partition: ALL > >> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 15G > >> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > >> Exec: '' ExecSize: 0 ImageSize: 0 > >> Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G > >> NodeAccess: SHARED > >> NodeCount: 1 > >> > >> > >> IWD: [NONE] Executable: [NONE] > >> Bypass: 0 StartCount: 0 > >> PartitionMask: [ALL] > >> Flags: RESTARTABLE > >> > >> job is deferred. Reason: NoResources (cannot create > reservation for > >> job '55' (intital reservation attempt) > >> ) > >> Holds: Defer (hold reason: NoResources) > >> PE: 16.03 StartPriority: 1 > >> cannot select job 55 for partition DEFAULT (job hold active) > >> > >> > >> > >> show that there are no resources available. The node is free, > and unloaded: > >> > >> # checknode testing > >> > >> > >> checking node testing.azr.nl > >> > >> State: Idle (in current state for 2:23:54) > >> Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M DISK: 1M > >> Utilized Resources: SWAP: 149M > >> Dedicated Resources: [NONE] > >> Opsys: linux Arch: [NONE] > >> Speed: 1.00 Load: 0.050 > >> Network: [DEFAULT] > >> Features: [NONE] > >> Attributes: [Batch] > >> Classes: [batch 2:2] > >> > >> Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: 00:01:00 > (0.10%) > >> > >> Reservations: > >> NOTE: no reservations on node > >> > >> > >> > >> When the job is added, maui.log shows this: > >> 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl > ,J,TaskList,0) > >> 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) > >> 04/05 13:21:34 INFO: processing node request line '1' > >> 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) > >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > >> 04/05 13:21:34 INFO: default QOS for job 55 set to DEFAULT(0) > >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) > >> 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan sebastiaan > >> 21600 Idle 0 1333624893 [NONE] [NONE] [NONE]>= > 0>= 0 > >> [1][ppn=1] 1333624894 > >> 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING > >> 04/05 13:21:34 INFO: jobs detected: 12 > >> 04/05 13:21:34 MStatClearUsage(node,Active) > >> 04/05 13:21:34 MClusterUpdateNodeState() > >> 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) > >> 04/05 13:21:34 INFO: job '40' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '41' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '42' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '44' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '45' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '47' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '48' Priority: 16 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '49' Priority: 12 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '52' Priority: 8 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '53' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '54' Priority: 60 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '55' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 MStatClearUsage([NONE],Active) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 > [EState: 11] > >> 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) > >> 04/05 13:21:34 INFO: job '40' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '41' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '42' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '44' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '45' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '47' Priority: 22 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '48' Priority: 16 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '49' Priority: 12 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '52' Priority: 8 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '53' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '54' Priority: 60 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 INFO: job '55' Priority: 1 > >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: 0(00.0) > Attr: > >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: > 0(00.0) Us: > >> 0(00.0) > >> 04/05 13:21:34 MStatClearUsage([NONE],Idle) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 MResDestroy(NULL) > >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 > [EState: 11] > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 > >> 04/05 13:21:34 MQueueScheduleRJobs(Q) > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 1/1 > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition > DEFAULT: 1/1 > >> 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) > >> 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in > >> partition DEFAULT (1 Needed) > >> 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) > >> 04/05 13:21:34 MJobReserve(55,Priority) > >> 04/05 13:21:34 ALERT: job 55 cannot run in any partition > >> 04/05 13:21:34 ALERT: cannot create new reservation for job 55 > >> (shape[1] 1) > >> 04/05 13:21:34 ALERT: cannot create new reservation for job 55 > >> 04/05 13:21:34 MJobSetHold(55,16,1:00:00,NoResources,cannot create > >> reservation for job '55' (intital reservation attempt) > >> ) > >> 04/05 13:21:34 ALERT: job '55' cannot run (deferring job for > 3600 > >> seconds) > >> 04/05 13:21:34 WARNING: cannot reserve priority job '55' > >> Active Jobs------ > >> ------------------ > >> 04/05 13:21:34 INFO: resources available after scheduling: > N: 1 P: 2 > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition > DEFAULT: 0/1 > >> [EState: 1] > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 > >> [EState: 1] > >> 04/05 13:21:34 > >> > MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) > >> 04/05 13:21:34 INFO: total jobs selected in partition ALL: 0/1 > >> [EState: 1] > >> 04/05 13:21:34 MSchedUpdateStats() > >> 04/05 13:21:34 INFO: iteration: 288 scheduling time: > 0.002 seconds > >> 04/05 13:21:34 MResUpdateStats() > >> 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) PH: 0.00% > >> active jobs: 0 of 2 (completed: 1) > >> 04/05 13:21:34 MQueueCheckStatus() > >> 04/05 13:21:34 MNodeCheckStatus() > >> 04/05 13:21:34 MUClearChild(PID) > >> 04/05 13:21:34 INFO: scheduling complete. sleeping 30 seconds > >> > >> > >> I think the relevant line is: > >> 04/05 13:21:34 INFO: 0 feasible tasks found for job 55:0 in > >> partition DEFAULT (1 Needed) > >> > >> but I have no idea how to make a feasible task for the job. I > have tried > >> queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. but > none seem > >> to have had effect. > >> > >> > >> > >> Torque config. I have tried setting different attributes to the > queue > >> properties, hoping that it would have some effect: > >> # qmgr -c "p s" > >> # > >> # Create queues and set their attributes. > >> # > >> # > >> # Create and define queue batch > >> # > >> create queue batch > >> set queue batch queue_type = Execution > >> set queue batch Priority = 20 > >> set queue batch max_running = 8 > >> set queue batch resources_max.ncpus = 8 > >> set queue batch resources_max.nodect = 10 > >> set queue batch resources_max.nodes = 2 > >> set queue batch resources_min.ncpus = 0 > >> set queue batch resources_default.mem = 2000mb > >> set queue batch resources_default.ncpus = 1 > >> set queue batch resources_default.neednodes = 1:ppn=1 > >> set queue batch resources_default.nodect = 1 > >> set queue batch resources_default.nodes = 1 > >> set queue batch resources_default.pvmem = 16000mb > >> set queue batch resources_default.walltime = 06:00:00 > >> set queue batch enabled = True > >> set queue batch started = True > >> # > >> # Set server attributes. > >> # > >> set server scheduling = True > >> set server acl_hosts = testing.azr.nl > >> set server log_events = 511 > >> set server mail_from = adm > >> set server resources_available.nodect = 10 > >> set server scheduler_iteration = 600 > >> set server node_check_rate = 150 > >> set server tcp_timeout = 6 > >> set server next_job_number = 56 > >> > >> > >> Maui configuration, untouched: > >> # maui.cfg 3.3.1 > >> > >> SERVERHOST testing > >> # primary admin must be first in list > >> ADMIN1 root > >> > >> # Resource Manager Definition > >> > >> RMCFG[TESTING] TYPE=PBS > >> > >> # Allocation Manager Definition > >> > >> AMCFG[bank] TYPE=NONE > >> > >> # full parameter docs at > http://supercluster.org/mauidocs/a.fparameters.html > >> # use the 'schedctl -l' command to display current configuration > >> > >> RMPOLLINTERVAL 00:00:30 > >> > >> SERVERPORT 42559 > >> SERVERMODE NORMAL > >> > >> # Admin: http://supercluster.org/mauidocs/a.esecurity.html > >> > >> > >> LOGFILE maui.log > >> LOGFILEMAXSIZE 10000000 > >> LOGLEVEL 3 > >> > >> # Job Priority: > http://supercluster.org/mauidocs/5.1jobprioritization.html > >> > >> QUEUETIMEWEIGHT 1 > >> > >> # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html > >> > >> #FSPOLICY PSDEDICATED > >> #FSDEPTH 7 > >> #FSINTERVAL 86400 > >> #FSDECAY 0.80 > >> > >> # Throttling Policies: > >> http://supercluster.org/mauidocs/6.2throttlingpolicies.html > >> > >> # NONE SPECIFIED > >> > >> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html > >> > >> BACKFILLPOLICY FIRSTFIT > >> RESERVATIONPOLICY CURRENTHIGHEST > >> > >> # Node Allocation: > http://supercluster.org/mauidocs/5.2nodeallocation.html > >> > >> NODEALLOCATIONPOLICY MINRESOURCE > >> > >> # QOS: http://supercluster.org/mauidocs/7.3qos.html > >> > >> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB > >> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE > >> > >> # Standing Reservations: > >> http://supercluster.org/mauidocs/7.1.3standingreservations.html > >> > >> # SRSTARTTIME[test] 8:00:00 > >> # SRENDTIME[test] 17:00:00 > >> # SRDAYS[test] MON TUE WED THU FRI > >> # SRTASKCOUNT[test] 20 > >> # SRMAXTIME[test] 0:30:00 > >> > >> # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html > >> > >> # USERCFG[DEFAULT] FSTARGET=25.0 > >> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- > >> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi > >> # CLASSCFG[batch] FLAGS=PREEMPTEE > >> # CLASSCFG[interactive] FLAGS=PREEMPTOR > >> > >> > >> > >> Any ideas? > >> > >> Thanks in advance, > >> Sebastiaan > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/6e3cb821/attachment-0001.html From s.breedveld at erasmusmc.nl Fri May 4 13:10:46 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Fri, 04 May 2012 21:10:46 +0200 Subject: [torqueusers] How to disenage email notifications? In-Reply-To: <20120504223202.1d5b76b3.bircoph@gmail.com> References: <20120504223202.1d5b76b3.bircoph@gmail.com> Message-ID: <4FA429B6.4050004@erasmusmc.nl> Hi, On 05/04/2012 08:32 PM, Andrew Savchenko wrote: > Hello, > > I want to completely disable Torque email notifications for all > users. qsub_filter is not an option, because qsub -m CLI argument > overrides #PBS directive in the qsub script. Complete MTA removal > from the system is also undesired. I can't find any config or qmgr > option for this task... You can do that by setting mail_domain to never: set server mail_domain = never I once read this is an undocumented option, but works well. Greetings, Sebastiaan > I use torque-3.0.5 (just upgraded). > > Best regards, > Andrew Savchenko > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Sebastiaan Breedveld, MSc. Ph.D. student Erasmus MC - Daniel den Hoed Cancer Center Department of Radiation Oncology Groene Hilledijk 301 3075 EA Rotterdam The Netherlands Phone: +31 10 7042693 Room: Gs-20 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/73f57cb0/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: s_breedveld.vcf Type: text/x-vcard Size: 365 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120504/73f57cb0/attachment.vcf From mej at lbl.gov Fri May 4 19:05:30 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 4 May 2012 18:05:30 -0700 Subject: [torqueusers] torque 4.0.1 snapshot build issues In-Reply-To: <4F967600.8080606@ugent.be> References: <4F967600.8080606@ugent.be> Message-ID: <20120505010529.GG9750@lbl.gov> On Tuesday, 24 April 2012, at 11:44:32 (+0200), Stijn De Weirdt wrote: > however, i'm stuck building the rpms: > a. the torque.spec file isn't that snapshot friendly (you can't > override tarversion from commandline etc etc). That's because you're not supposed to. ;-) If you downloaded a snapshot tarball, just do: rpmbuild -ta torque-4.0.1-snap.201204031702.tar.gz Should work fine. If you're working from SVN, run "make snap" and then "rpmbuild -ta" on the resulting tarball. Unfortunately, on 4.0-fixes, you'll have to apply the attached patch first. Some of the init scripts were listed twice -- once in a Makefile.am that knew how to build them...and once in one that didn't. :-( > b. building the spec file with drmaa support fails when making the > src/drmaa/test with > > ../../../src/drmaa/src/.libs/libdrmaa.so: undefined reference to > `pbs_statjob_err' > > > the makefile doesn't link with ../../../src/lib/Libpbs/.libs/libtorque.so Works for me in SVN and in the 4.0.1 release tarball. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 -------------- next part -------------- Index: Makefile.am =================================================================== --- Makefile.am (revision 6095) +++ Makefile.am (working copy) @@ -34,18 +34,6 @@ buildutils/self-extract-head-sh.in \ buildutils/torque.spec.in \ contrib/AddPrivileges \ - contrib/init.d/debian.pbs_mom \ - contrib/init.d/debian.pbs_sched \ - contrib/init.d/debian.pbs_server \ - contrib/init.d/debian.trqauthd \ - contrib/init.d/pbs_mom \ - contrib/init.d/pbs_sched \ - contrib/init.d/pbs_server \ - contrib/init.d/trqauthd \ - contrib/init.d/suse.pbs_mom \ - contrib/init.d/suse.pbs_sched \ - contrib/init.d/suse.pbs_server \ - contrib/init.d/suse.trqauthd \ contrib/hwloc_install.sh \ contrib/mom_gencfg \ contrib/pam_authuser.tar.gz \ From jujj603 at gmail.com Fri May 4 19:55:02 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Sat, 5 May 2012 09:55:02 +0800 Subject: [torqueusers] meaning of output of qstat -Qf In-Reply-To: References: Message-ID: I think you misunderstand what i say. Like this below: resources_min.nodect = 1 resources_default.nodes = 80 I check the document there are nodect and nodes of all resources_ attributes. Resources may include one or more of the following: arch, mem, *nodes*, ncpus, *nodect*, pvmem, and walltime. You can find it here : http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml Maybe nodes is a synonymous of nodect, at least on intuition. Thank for your reply. On Sat, May 5, 2012 at 12:26 AM, Ken Nielson wrote: > > > On Thu, May 3, 2012 at 11:22 PM, Ju JiaJia wrote: > > >> Hi >> I checked the documentation, but i still don't know what >> resources_min.nodect mean, Can anyone explain this ? >> And What's the difference between nodect and nodes ? >> >> Any reply will be appreciated. >> >> > resources_min.nodect sets the minimum number of nodes need in a job before > the job can be eligible for the queue. Nodes are different then procs. A > node is a host. So for example qsub -l nodes=2:ppn=4 job.sh requests 2 > nodes. The nodect is 2. Total procs or cores for the job is 4. (2 nodes X 2 > procs/node). > > Regards > > Ken > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120505/30418020/attachment.html From s.breedveld at erasmusmc.nl Sat May 5 02:59:38 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Sat, 05 May 2012 10:59:38 +0200 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources In-Reply-To: References: <4FA133F2.80303@erasmusmc.nl> <2445104.utCSrFlz9W@newton.cc.uit.no> <4FA393B1.5000108@erasmusmc.nl> <4FA42581.4010901@erasmusmc.nl> Message-ID: <4FA4EBFA.2040408@erasmusmc.nl> On 05/05/2012 04:07 AM, Ju JiaJia wrote: > Can you run qsub -q batch -l pvmem=10mb,mem=10mb test-script.sh with > both normal user and root ? > > The scheduler does not allow me to submit jobs as root. As a normal user, the job is still not executed: checking job 63 (RM job '63.testing.azr.nl') State: Idle EState: Deferred Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT WallTime: 00:00:00 of 6:00:00 SubmitTime: Sat May 5 07:53:05 (Time Queued Total: 3:04:41 Eligible: 00:00:00) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 10M Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 10M SWAP: 10M NodeAccess: SHARED NodeCount: 1 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] SystemQueueTime: Sat May 5 10:54:09 Flags: RESTARTABLE job is deferred. Reason: NoResources (cannot create reservation for job '63' (intital reservation attempt) ) Holds: Defer (hold reason: NoResources) PE: 1.00 StartPriority: 3 cannot select job 63 for partition DEFAULT (job hold active) > On Sat, May 5, 2012 at 2:52 AM, Sebastiaan Breedveld > > wrote: > > Dear Ju, > > > > On 05/04/2012 01:33 PM, Ju JiaJia wrote: >> How many processes and nodes used in you test-script.sh ? > Nothing special defined: > > #!/bin/bash > # > # This is the unique output file for this job > OUTPUTFILE=test_output_${PBS_JOBID}.log > # Print > echo "Output for this job:" > $OUTPUTFILE > # Do something > sleep 1m > # Get a list of environment variables > date >> $OUTPUTFILE > # Do something more > sleep 5s > # quit > exit 0 > > > >> Check maui.cfg to see whether you have enough right to run multi >> processes. If you don't, you job will just hang and without any >> error messages. >> In maui.cfg, something like this: >> USERCFG[your-user-name] MAXJOB=1 MAXQUEUEDJOB=1 MAXNODE=1 MAXPROC=1 >> and default you can only run one process on one node. >> > There are no lines like that. The maui.cfg is untouched: > > > # maui.cfg 3.3.1 > SERVERHOST testing > ADMIN1 root > RMCFG[ULURU-TESTING] TYPE=PBS > AMCFG[bank] TYPE=NONE > > RMPOLLINTERVAL 00:00:30 > SERVERPORT 42559 > SERVERMODE NORMAL > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > QUEUETIMEWEIGHT 1 > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEALLOCATIONPOLICY MINRESOURCE > > Sincerely, > Sebastiaan > > >> On Fri, May 4, 2012 at 4:30 PM, Sebastiaan Breedveld >> > wrote: >> >> Dear Roy, >> >> On 05/03/2012 01:51 PM, Roy Dragseth wrote: >> > It seems like you are trying to submit a job wich requires >> 15 GB pvmem on a >> > nodes that have less than 3 GB total virtual memory. Maui >> also considers >> > specified memory parameters when scheduling jobs. >> Thank you for the hint. I have now set the default resources >> for memory >> to 1b in the queue: >> >> resources_max.ncpus = 8 >> resources_max.nodect = 10 >> resources_max.nodes = 2 >> resources_min.ncpus = 0 >> resources_default.mem = 1b >> resources_default.ncpus = 1 >> resources_default.neednodes = 1:ppn=1 >> resources_default.nodect = 1 >> resources_default.nodes = 1 >> resources_default.pvmem = 1b >> >> and additionally submitted a job setting low requirements: >> >> $ qsub -q batch -l pvmem=1,mem=1 test-script.sh >> >> Unfortunately, the job still does not run: >> >> # checkjob -v 61 >> >> >> checking job 61 (RM job '61.testing.azr.nl >> ') >> >> State: Idle EState: Deferred >> Creds: user:sebastiaan group:sebastiaan class:batch >> qos:DEFAULT >> WallTime: 00:00:00 of 6:00:00 >> SubmitTime: Fri May 4 10:27:03 >> (Time Queued Total: 00:01:34 Eligible: 00:00:01) >> >> Total Tasks: 1 >> >> Req[0] TaskCount: 1 Partition: ALL >> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 1M >> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >> Exec: '' ExecSize: 0 ImageSize: 0 >> Dedicated Resources Per Task: PROCS: 1 MEM: 1M SWAP: 1M >> NodeAccess: SHARED >> NodeCount: 1 >> >> >> IWD: [NONE] Executable: [NONE] >> Bypass: 0 StartCount: 0 >> PartitionMask: [ALL]u >> Flags: RESTARTABLE >> >> job is deferred. Reason: NoResources (cannot create >> reservation for >> job '61' (intital reservation attempt) >> ) >> Holds: Defer (hold reason: NoResources) >> PE: 1.00 StartPriority: 1 >> cannot select job 61 for partition DEFAULT (job hold active) >> >> >> Any other hints? >> >> Thanks in advance, >> Sebastiaan >> >> >> > r. >> > >> > >> > On Wednesday, May 02, 2012 15:17:38 Sebastiaan Breedveld wrote: >> >> Dear list, >> >> >> >> This may be a Maui issue, but the Maui list seems dead :( >> >> >> >> >> >> I am trying to setup a very basic Torque+Maui system. I am >> running a >> >> Torque cluster for a year now, and wanted to improve the >> scheduling with >> >> Maui. To this end, I installed a fresh test-system, with >> server and node >> >> on a single computer. >> >> >> >> Torque version: 2.4.16 >> >> Maui version: 3.3.1 >> >> uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP Tue >> Mar 27 16:42:26 >> >> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >> >> >> >> >> I was able to run (simple) jobs with the Torque scheduler. >> When I >> >> replaced the scheduler with Maui, jobs stay queued. Jobs >> are submitted by: >> >> >> >> $ qsub -q batch test-script.sh >> >> >> >> where test-script.sh is nothing more than a 'sleep 1m' >> script. Checking >> >> the job: >> >> >> >> # checkjob -v 55 >> >> checking job 55 (RM job '55.testing.azr.nl >> ') >> >> >> >> State: Idle EState: Deferred >> >> Creds: user:sebastiaan group:sebastiaan class:batch >> qos:DEFAULT >> >> WallTime: 00:00:00 of 6:00:00 >> >> SubmitTime: Thu Apr 5 13:21:33 >> >> (Time Queued Total: 00:00:32 Eligible: 00:00:01) >> >> >> >> Total Tasks: 1 >> >> >> >> Req[0] TaskCount: 1 Partition: ALL >> >> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 15G >> >> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >> >> Exec: '' ExecSize: 0 ImageSize: 0 >> >> Dedicated Resources Per Task: PROCS: 1 MEM: 2000M SWAP: 15G >> >> NodeAccess: SHARED >> >> NodeCount: 1 >> >> >> >> >> >> IWD: [NONE] Executable: [NONE] >> >> Bypass: 0 StartCount: 0 >> >> PartitionMask: [ALL] >> >> Flags: RESTARTABLE >> >> >> >> job is deferred. Reason: NoResources (cannot create >> reservation for >> >> job '55' (intital reservation attempt) >> >> ) >> >> Holds: Defer (hold reason: NoResources) >> >> PE: 16.03 StartPriority: 1 >> >> cannot select job 55 for partition DEFAULT (job hold active) >> >> >> >> >> >> >> >> show that there are no resources available. The node is >> free, and unloaded: >> >> >> >> # checknode testing >> >> >> >> >> >> checking node testing.azr.nl >> >> >> >> State: Idle (in current state for 2:23:54) >> >> Configured Resources: PROCS: 2 MEM: 984M SWAP: 1996M >> DISK: 1M >> >> Utilized Resources: SWAP: 149M >> >> Dedicated Resources: [NONE] >> >> Opsys: linux Arch: [NONE] >> >> Speed: 1.00 Load: 0.050 >> >> Network: [DEFAULT] >> >> Features: [NONE] >> >> Attributes: [Batch] >> >> Classes: [batch 2:2] >> >> >> >> Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: >> 00:01:00 (0.10%) >> >> >> >> Reservations: >> >> NOTE: no reservations on node >> >> >> >> >> >> >> >> When the job is added, maui.log shows this: >> >> 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl >> ,J,TaskList,0) >> >> 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) >> >> 04/05 13:21:34 INFO: processing node request line '1' >> >> 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) >> >> 04/05 13:21:34 INFO: default QOS for job 55 set to >> DEFAULT(0) >> >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >> >> 04/05 13:21:34 INFO: default QOS for job 55 set to >> DEFAULT(0) >> >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >> >> 04/05 13:21:34 INFO: default QOS for job 55 set to >> DEFAULT(0) >> >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >> >> 04/05 13:21:34 INFO: job '55' loaded: 1 sebastiaan >> sebastiaan >> >> 21600 Idle 0 1333624893 [NONE] [NONE] [NONE]>= >> 0>= 0 >> >> [1][ppn=1] 1333624894 >> >> 04/05 13:21:34 INFO: 12 PBS jobs detected on RM TESTING >> >> 04/05 13:21:34 INFO: jobs detected: 12 >> >> 04/05 13:21:34 MStatClearUsage(node,Active) >> >> 04/05 13:21:34 MClusterUpdateNodeState() >> >> 04/05 13:21:34 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) >> >> 04/05 13:21:34 INFO: job '40' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '41' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '42' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '44' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '45' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '47' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '48' Priority: 16 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '49' Priority: 12 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '52' Priority: 8 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '53' Priority: 1 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '54' Priority: 60 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '55' Priority: 1 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 MStatClearUsage([NONE],Active) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 >> [EState: 11] >> >> 04/05 13:21:34 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) >> >> 04/05 13:21:34 INFO: job '40' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '41' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '42' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '44' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '45' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '47' Priority: 22 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '48' Priority: 16 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '49' Priority: 12 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '52' Priority: 8 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '53' Priority: 1 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '54' Priority: 60 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 INFO: job '55' Priority: 1 >> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >> 0(00.0) Attr: >> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >> 0(00.0) Us: >> >> 0(00.0) >> >> 04/05 13:21:34 MStatClearUsage([NONE],Idle) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 MResDestroy(NULL) >> >> 04/05 13:21:34 INFO: total jobs selected (ALL): 1/12 >> [EState: 11] >> >> 04/05 13:21:34 >> >> >> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) >> >> 04/05 13:21:34 INFO: total jobs selected in partition >> ALL: 1/1 >> >> 04/05 13:21:34 MQueueScheduleRJobs(Q) >> >> 04/05 13:21:34 >> >> >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) >> >> 04/05 13:21:34 INFO: total jobs selected in partition >> ALL: 1/1 >> >> 04/05 13:21:34 >> >> >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) >> >> 04/05 13:21:34 INFO: total jobs selected in partition >> DEFAULT: 1/1 >> >> 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) >> >> 04/05 13:21:34 INFO: 0 feasible tasks found for job >> 55:0 in >> >> partition DEFAULT (1 Needed) >> >> 04/05 13:21:34 MJobPReserve(55,DEFAULT,ResCount,ResCountRej) >> >> 04/05 13:21:34 MJobReserve(55,Priority) >> >> 04/05 13:21:34 ALERT: job 55 cannot run in any partition >> >> 04/05 13:21:34 ALERT: cannot create new reservation for >> job 55 >> >> (shape[1] 1) >> >> 04/05 13:21:34 ALERT: cannot create new reservation for >> job 55 >> >> 04/05 13:21:34 >> MJobSetHold(55,16,1:00:00,NoResources,cannot create >> >> reservation for job '55' (intital reservation attempt) >> >> ) >> >> 04/05 13:21:34 ALERT: job '55' cannot run (deferring >> job for 3600 >> >> seconds) >> >> 04/05 13:21:34 WARNING: cannot reserve priority job '55' >> >> Active Jobs------ >> >> ------------------ >> >> 04/05 13:21:34 INFO: resources available after >> scheduling: N: 1 P: 2 >> >> 04/05 13:21:34 >> >> >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) >> >> 04/05 13:21:34 INFO: total jobs selected in partition >> DEFAULT: 0/1 >> >> [EState: 1] >> >> 04/05 13:21:34 >> >> >> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) >> >> 04/05 13:21:34 INFO: total jobs selected in partition >> ALL: 0/1 >> >> [EState: 1] >> >> 04/05 13:21:34 >> >> >> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) >> >> 04/05 13:21:34 INFO: total jobs selected in partition >> ALL: 0/1 >> >> [EState: 1] >> >> 04/05 13:21:34 MSchedUpdateStats() >> >> 04/05 13:21:34 INFO: iteration: 288 scheduling >> time: 0.002 seconds >> >> 04/05 13:21:34 MResUpdateStats() >> >> 04/05 13:21:34 INFO: current util[288]: 0/1 (0.00%) >> PH: 0.00% >> >> active jobs: 0 of 2 (completed: 1) >> >> 04/05 13:21:34 MQueueCheckStatus() >> >> 04/05 13:21:34 MNodeCheckStatus() >> >> 04/05 13:21:34 MUClearChild(PID) >> >> 04/05 13:21:34 INFO: scheduling complete. sleeping 30 >> seconds >> >> >> >> >> >> I think the relevant line is: >> >> 04/05 13:21:34 INFO: 0 feasible tasks found for job >> 55:0 in >> >> partition DEFAULT (1 Needed) >> >> >> >> but I have no idea how to make a feasible task for the >> job. I have tried >> >> queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, etc. >> but none seem >> >> to have had effect. >> >> >> >> >> >> >> >> Torque config. I have tried setting different attributes >> to the queue >> >> properties, hoping that it would have some effect: >> >> # qmgr -c "p s" >> >> # >> >> # Create queues and set their attributes. >> >> # >> >> # >> >> # Create and define queue batch >> >> # >> >> create queue batch >> >> set queue batch queue_type = Execution >> >> set queue batch Priority = 20 >> >> set queue batch max_running = 8 >> >> set queue batch resources_max.ncpus = 8 >> >> set queue batch resources_max.nodect = 10 >> >> set queue batch resources_max.nodes = 2 >> >> set queue batch resources_min.ncpus = 0 >> >> set queue batch resources_default.mem = 2000mb >> >> set queue batch resources_default.ncpus = 1 >> >> set queue batch resources_default.neednodes = 1:ppn=1 >> >> set queue batch resources_default.nodect = 1 >> >> set queue batch resources_default.nodes = 1 >> >> set queue batch resources_default.pvmem = 16000mb >> >> set queue batch resources_default.walltime = 06:00:00 >> >> set queue batch enabled = True >> >> set queue batch started = True >> >> # >> >> # Set server attributes. >> >> # >> >> set server scheduling = True >> >> set server acl_hosts = testing.azr.nl >> >> set server log_events = 511 >> >> set server mail_from = adm >> >> set server resources_available.nodect = 10 >> >> set server scheduler_iteration = 600 >> >> set server node_check_rate = 150 >> >> set server tcp_timeout = 6 >> >> set server next_job_number = 56 >> >> >> >> >> >> Maui configuration, untouched: >> >> # maui.cfg 3.3.1 >> >> >> >> SERVERHOST testing >> >> # primary admin must be first in list >> >> ADMIN1 root >> >> >> >> # Resource Manager Definition >> >> >> >> RMCFG[TESTING] TYPE=PBS >> >> >> >> # Allocation Manager Definition >> >> >> >> AMCFG[bank] TYPE=NONE >> >> >> >> # full parameter docs at >> http://supercluster.org/mauidocs/a.fparameters.html >> >> # use the 'schedctl -l' command to display current >> configuration >> >> >> >> RMPOLLINTERVAL 00:00:30 >> >> >> >> SERVERPORT 42559 >> >> SERVERMODE NORMAL >> >> >> >> # Admin: http://supercluster.org/mauidocs/a.esecurity.html >> >> >> >> >> >> LOGFILE maui.log >> >> LOGFILEMAXSIZE 10000000 >> >> LOGLEVEL 3 >> >> >> >> # Job Priority: >> http://supercluster.org/mauidocs/5.1jobprioritization.html >> >> >> >> QUEUETIMEWEIGHT 1 >> >> >> >> # FairShare: >> http://supercluster.org/mauidocs/6.3fairshare.html >> >> >> >> #FSPOLICY PSDEDICATED >> >> #FSDEPTH 7 >> >> #FSINTERVAL 86400 >> >> #FSDECAY 0.80 >> >> >> >> # Throttling Policies: >> >> http://supercluster.org/mauidocs/6.2throttlingpolicies.html >> >> >> >> # NONE SPECIFIED >> >> >> >> # Backfill: http://supercluster.org/mauidocs/8.2backfill.html >> >> >> >> BACKFILLPOLICY FIRSTFIT >> >> RESERVATIONPOLICY CURRENTHIGHEST >> >> >> >> # Node Allocation: >> http://supercluster.org/mauidocs/5.2nodeallocation.html >> >> >> >> NODEALLOCATIONPOLICY MINRESOURCE >> >> >> >> # QOS: http://supercluster.org/mauidocs/7.3qos.html >> >> >> >> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 >> FLAGS=PREEMPTOR:IGNMAXJOB >> >> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE >> >> >> >> # Standing Reservations: >> >> >> http://supercluster.org/mauidocs/7.1.3standingreservations.html >> >> >> >> # SRSTARTTIME[test] 8:00:00 >> >> # SRENDTIME[test] 17:00:00 >> >> # SRDAYS[test] MON TUE WED THU FRI >> >> # SRTASKCOUNT[test] 20 >> >> # SRMAXTIME[test] 0:30:00 >> >> >> >> # Creds: >> http://supercluster.org/mauidocs/6.1fairnessoverview.html >> >> >> >> # USERCFG[DEFAULT] FSTARGET=25.0 >> >> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- >> >> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi >> >> # CLASSCFG[batch] FLAGS=PREEMPTEE >> >> # CLASSCFG[interactive] FLAGS=PREEMPTOR >> >> >> >> >> >> >> >> Any ideas? >> >> >> >> Thanks in advance, >> >> Sebastiaan >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120505/9f768085/attachment-0001.html From Gareth.Williams at csiro.au Sat May 5 04:12:47 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 5 May 2012 20:12:47 +1000 Subject: [torqueusers] meaning of output of qstat -Qf In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C62010503312ADB@exvic-mbx04.nexus.csiro.au> > I check the document there?are nodect and nodes of all resources_ attributes. Resources may include one or more of the following: arch, mem, nodes, ncpus, nodect, pvmem, and walltime. You can find it here : http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml Maybe nodes is a synonymous of nodect, at least on intuition. ? My understanding is as follows. 'nodes' is a string. As such it makes no sense to have a min and max limit (unless they are the same so that nodes must be a given value). 'nodect' is a corresponding automatic resource that is an integer count of the number of nodes. You can set a numerical max or min limit on it and jobs should get rejected or routed accordingly. I've never found nodect to be useful and suggest not setting anything associated with it. Recent versions of torque also have a (transient) automatic resource 'procct' which is a bit like nodect but counts processors (or slots or cores if you prefer). That is a more useful quantity to route or limit jobs on a multi-core system. Gareth From bircoph at gmail.com Sat May 5 07:05:01 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Sat, 5 May 2012 17:05:01 +0400 Subject: [torqueusers] How to disenage email notifications? In-Reply-To: <4FA429B6.4050004@erasmusmc.nl> References: <20120504223202.1d5b76b3.bircoph@gmail.com> <4FA429B6.4050004@erasmusmc.nl> Message-ID: <20120505170501.5d98bc0a.bircoph@gmail.com> Hello, On Fri, 04 May 2012 21:10:46 +0200 Sebastiaan Breedveld wrote: > On 05/04/2012 08:32 PM, Andrew Savchenko wrote: > > I want to completely disable Torque email notifications for all > > users. qsub_filter is not an option, because qsub -m CLI argument > > overrides #PBS directive in the qsub script. Complete MTA removal > > from the system is also undesired. I can't find any config or qmgr > > option for this task... > You can do that by setting mail_domain to never: > set server mail_domain = never > > I once read this is an undocumented option, but works well. Thanks, this works. And I checked torque-3.0.3 administration guide ? it is documented there, I just missed it :/ Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120505/ef764937/attachment.bin From bircoph at gmail.com Sat May 5 10:59:57 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Sat, 5 May 2012 20:59:57 +0400 Subject: [torqueusers] How to set a specific queue order in qstat? Message-ID: <20120505205957.55b88c3f.bircoph@gmail.com> Hello, I want to set a specific list order for queues returned by qstat -Q. I can do this temporarily by deleting all queue and recreating them with a desired order. But this change is not preserved after pbs_server restart. I use torque-3.0.5. Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120505/e2ea039d/attachment.bin From bircoph at gmail.com Sat May 5 13:23:50 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Sat, 5 May 2012 23:23:50 +0400 Subject: [torqueusers] qterm exits with code 141 Message-ID: <20120505232350.7208586d.bircoph@gmail.com> Hello, when I stop pbs_server with any shutdown type: qterm -t quick qterm -t immediate qterm -t delay it exits with code 141. This confuses wrapper scripts. Though an exit itself looks normally: no error messages in the console on in the server log file (with default log_lever = 511). I can't find neither reason nor meaning of this exit code. It doesn't fit into PBS error codes table (Appendix D.2. of the Torque Administration Guide), and 141 is not a valid system errno code. My torque version is 3.0.5. Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120505/06dedc5d/attachment.bin From jujj603 at gmail.com Sat May 5 20:15:22 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Sun, 6 May 2012 10:15:22 +0800 Subject: [torqueusers] meaning of output of qstat -Qf In-Reply-To: <007DECE986B47F4EABF823C1FBB19C62010503312ADB@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C62010503312ADB@exvic-mbx04.nexus.csiro.au> Message-ID: Thx for your reply. I take nodes and nodect as the same thing currently. Thx for you advice, again. On Sat, May 5, 2012 at 6:12 PM, wrote: > > I check the document there are nodect and nodes of all resources_ max, available etc> attributes. > Resources may include one or more of the following: arch, mem, nodes, > ncpus, nodect, pvmem, and walltime. > You can find it here : > http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml > Maybe nodes is a synonymous of nodect, at least on intuition. > > My understanding is as follows. > > 'nodes' is a string. As such it makes no sense to have a min and max limit > (unless they are the same so that nodes must be a given value). > > 'nodect' is a corresponding automatic resource that is an integer count of > the number of nodes. You can set a numerical max or min limit on it and > jobs should get rejected or routed accordingly. > > I've never found nodect to be useful and suggest not setting anything > associated with it. > > Recent versions of torque also have a (transient) automatic resource > 'procct' which is a bit like nodect but counts processors (or slots or > cores if you prefer). That is a more useful quantity to route or limit jobs > on a multi-core system. > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120506/9fa68171/attachment.html From Adrian.Sevcenco at cern.ch Mon May 7 06:16:18 2012 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Mon, 7 May 2012 15:16:18 +0300 Subject: [torqueusers] torque :: jobs on hold Message-ID: <4FA7BD12.5070505@cern.ch> Hi! after a changing og interfaces on torque server all my jobs are stuck on hold ..and i have no idea why!! checking a node gives me : root at alien: torque # checknode alien-0-2 checking node alien-0-2 State: Idle (in current state for 1:04:45) Configured Resources: PROCS: 8 MEM: 15G SWAP: 19G DISK: 1M Utilized Resources: [NONE] Dedicated Resources: [NONE] Opsys: linux Arch: [NONE] Speed: 1.00 Load: 0.000 Network: [DEFAULT] Features: [NONE] Attributes: [Batch] Classes: [ops 8:8][alien 8:8][seeops 8:8][alice 8:8][dteam 8:8] Total Time: INFINITY Up: INFINITY (96.86%) Active: INFINITY (81.05%) Reservations: NOTE: no reservations on node root at alien: torque # momctl -d 3 -p 15003 -h alien-0-2 Host: alien-0-2.local/alien-0-2.local Version: 2.4.11 PID: 5831 Server[0]: alien.local (172.18.0.1:15001) Init Msgs Received: 0 hellos/1 cluster-addrs Init Msgs Sent: 1 hellos Last Msg From Server: 4 seconds (CLUSTER_ADDRS) Last Msg To Server: 3 seconds HomeDirectory: /opt/torque/mom_priv stdout/stderr spool directory: '/opt/torque/spool/' (2701791 blocks available) NOTE: syslog enabled MOM active: 6 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: RPP MemLocked: TRUE (mlock) TCP Timeout: 20 seconds Prolog: /opt/torque/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds Trusted Client List: 172.18.3.214,172.18.3.215,172.18.3.216,172.18.3.217,172.18.3.219,172.18.3.220,172.18.3.221,172.18.3.222,172.18.3.223,172.18.3.224,172.18.3.225,172.18.3.226,172.18.3.227,172.18.3.228,172.18.3.229,172.18.3.230,172.18.3.231,172.18.3.232,172.18.3.233,172.18.3.235,172.18.3.234,172.18.3.236,172.18.3.237,172.18.3.238,172.18.3.239,172.18.3.241,172.18.3.240,172.18.3.242,172.18.3.243,172.18.3.244,172.18.3.245,172.18.3.246,172.18.3.247,172.18.3.248,172.18.3.249,172.18.3.250,172.18.3.251,172.18.0.1,172.18.3.253,172.18.3.254,127.0.0.1 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete root at alien: torque # tracejob 42900 /opt/torque/mom_logs/20120507: No such file or directory /opt/torque/sched_logs/20120507: No such file or directory Job: 42900.alien.spacescience.ro 05/07/2012 14:14:46 S enqueuing into alien, state 1 hop 1 05/07/2012 14:14:46 S Job Queued at request of aliprod at alien.spacescience.ro, owner = aliprod at alien.spacescience.ro, job name = AliEn.7295.894, queue = alien 05/07/2012 14:14:46 S Job Modified at request of maui at alien.spacescience.ro 05/07/2012 14:14:46 S Job Run at request of maui at alien.spacescience.ro 05/07/2012 14:14:46 A queue=alien 05/07/2012 14:44:46 S Job Modified at request of maui at alien.spacescience.ro 05/07/2012 14:44:46 S Job Run at request of maui at alien.spacescience.ro so ... how can i further debug this problem? i am quite frustrated that all my jobs are on hold and i have no idea why ... Thanks! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1997 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/db42a89d/attachment-0001.bin From luiz at if.usp.br Mon May 7 08:33:58 2012 From: luiz at if.usp.br (Luiz Carlos dos Santos) Date: Mon, 7 May 2012 11:33:58 -0300 Subject: [torqueusers] queue Message-ID: <012101cd2c5e$7057c280$51074780$@if.usp.br> Hi all, I create a queue ?long3? directed for a determined node, but it is not working correctly. When I execute I have the next error message: --------------------------------------------------------------------------- [luiz at wilson pbsteste]$ qsub PBS_sub_script_test qsub: Job rejected by all possible destinations (check syntax, queue resources, ...) --------------------------------------------------------------------------- Please, could you help me? I include down the configuration of the queue. # create queue long3 set queue long3 queue_type = Execution set queue long3 Priority = 60 set queue long3 max_running = 10 set queue long3 acl_host_enable = True set queue long3 acl_hosts = wilson03 set queue long3 resources_max.cput = 24:00:00 set queue long3 resources_min.cput = 00:00:01 set queue long3 resources_default.cput = 12:00:00 set queue long3 enabled = True set queue long3 started = True # # Create and define queue default # create queue default set queue default queue_type = Route set queue default max_running = 10 set queue default route_destinations = long3 set queue default enabled = True set queue default started = True ------------------------------------------------------- Thanks, Luiz Carlos dos Santos Analista de Sistemas ? IFUSP/FMT Instituto de F?sica da USP Departamento de F?sica dos Materiais e Mec?nica P?a. do Oceanogr?fico - Trav E, s/n? Edif?cio Alessandro Volta, Bloco C - sala 112 CEP 05508-120 ? S?o Paulo SP Fone: (11) 3091-6784 / Fax: (11) 3091-6831 E-mail: luiz at if.usp.br -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/9d97d36f/attachment.html From jujj603 at gmail.com Mon May 7 08:45:29 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Mon, 7 May 2012 22:45:29 +0800 Subject: [torqueusers] queue In-Reply-To: <012101cd2c5e$7057c280$51074780$@if.usp.br> References: <012101cd2c5e$7057c280$51074780$@if.usp.br> Message-ID: What's inside PBS_sub_script_test ? And which scheduler are you using, maui ? If it is, what's inside the maui.cfg? On Mon, May 7, 2012 at 10:33 PM, Luiz Carlos dos Santos wrote: > Hi all,**** > > ** ** > > I create a queue ?long3? directed for a determined node, but it is not > working correctly. **** > > When I execute I have the next error message:**** > > ** ** > > --------------------------------------------------------------------------- > **** > > ** ** > > [luiz at wilson pbsteste]$ qsub PBS_sub_script_test **** > > ** ** > > qsub: Job rejected by all possible destinations (check syntax, queue > resources, ...)**** > > ** ** > > --------------------------------------------------------------------------- > **** > > ** ** > > Please, could you help me?**** > > ** ** > > I include down the configuration of the queue.**** > > #**** > > create queue long3**** > > set queue long3 queue_type = Execution**** > > set queue long3 Priority = 60**** > > set queue long3 max_running = 10**** > > set queue long3 acl_host_enable = True**** > > set queue long3 acl_hosts = wilson03**** > > set queue long3 resources_max.cput = 24:00:00**** > > set queue long3 resources_min.cput = 00:00:01**** > > set queue long3 resources_default.cput = 12:00:00**** > > set queue long3 enabled = True**** > > set queue long3 started = True**** > > #**** > > # Create and define queue default**** > > #**** > > create queue default**** > > set queue default queue_type = Route**** > > set queue default max_running = 10**** > > set queue default route_destinations = long3**** > > set queue default enabled = True**** > > set queue default started = True**** > > ** ** > > -------------------------------------------------------**** > > ** ** > > Thanks,**** > > ** ** > > Luiz Carlos dos Santos > Analista de Sistemas ? IFUSP/FMT**** > > Instituto de F?sica da USP**** > > Departamento de F?sica dos Materiais e Mec?nica > P?a. do Oceanogr?fico - Trav E, s/n? > Edif?cio Alessandro Volta, Bloco C - sala 112 > CEP 05508-120 ? S?o Paulo SP > Fone: (11) 3091-6784 / Fax: (11) 3091-6831**** > > E-mail: luiz at if.usp.br**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/3c26e321/attachment.html From taras.shapovalov at brightcomputing.com Sun May 6 12:58:20 2012 From: taras.shapovalov at brightcomputing.com (Taras Shapovalov) Date: Sun, 6 May 2012 20:58:20 +0200 Subject: [torqueusers] Job get stuck in obit state Message-ID: Hi, I have a cluster configuration where some computing nodes connected to a head node via VPN (the only case for them) and some are connected directly (via ethernet). If I submit a job (very simple job) to any of "non-VPN" node, it will be finished successfully. But if I submit a job to any of "VPN" node, it will get stuck in the running state (according qstat) long time (which unacceptable). Firther investigation has showed, that the job is in JOB_SUBSTATE_OBIT (no eplilogue scripts are used) on compute node and pbs_mom trying to send information that job is finished to pbs_server, but connection is failed. That how it looks in mom log (here ts-trunk-sl6tor-tor is a head node hostname): ... 05/06/2012 20:02:31;0008; pbs_mom;Job;process_request;request type StatusJob from host ts-trunk-sl6tor-tor.cm.cluster allowed 05/06/2012 20:02:58;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:02:58;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 32.ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:03:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in post_epilogue, 05/06/2012 20:03:17;0008; pbs_mom;Job;process_request;request type StatusJob from host ts-trunk-sl6tor-tor.cm.cluster allowed 05/06/2012 20:03:43;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:03:43;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 32.ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:04:02;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in post_epilogue, 05/06/2012 20:04:02;0008; pbs_mom;Job;process_request;request type StatusJob from host ts-trunk-sl6tor-tor.cm.cluster allowed 05/06/2012 20:04:28;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:04:28;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 32.ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:04:47;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in post_epilogue, 05/06/2012 20:04:47;0008; pbs_mom;Job;process_request;request type StatusJob from host ts-trunk-sl6tor-tor.cm.cluster allowed 05/06/2012 20:05:13;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 20:05:13;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 32.ts-trunk-sl6tor-tor.cm.cluster ... And how it looks via strace of pbs_mom: ... bind(10, {sa_family=AF_INET, sin_port=htons(633), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 connect(10, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("172.16.255.254")}, 16) = -1 EINPROGRESS (Operation now in progress) select(11, NULL, [10], NULL, {0, 10000}) = 0 (Timeout) select(11, NULL, [10], NULL, {0, 10000}) = 0 (Timeout) nanosleep({0, 1000000}, 0x7fffffffe680) = 0 close(10) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 10 fcntl(10, F_GETFL) = 0x2 (flags O_RDWR) fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(10, {sa_family=AF_INET, sin_port=htons(634), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 connect(10, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("172.16.255.254")}, 16) = -1 EINPROGRESS (Operation now in progress) select(11, NULL, [10], NULL, {0, 10000}) = 0 (Timeout) select(11, NULL, [10], NULL, {0, 10000}) = 0 (Timeout) ... If I do "telnet 172.16.255.254 15001" from compute node, then connection estableshed successfully! If I execute "netstat -tapn | grep pbs" on compute node it will display the next: tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN 28175/pbs_mom tcp 0 0 0.0.0.0:15003 0.0.0.0:* LISTEN 28175/pbs_mom tcp 0 1 172.16.255.251:541 172.16.255.254:15001 SYN_SENT 28175/pbs_mom It seams that pbs_mom ( 172.16.255.251:XXX) is trying to connect to server ( 172.16.255.254:15001), but can not recieve right answer (I guess) and trying to use another local port (XXX). gdb of running pbs_mom shows that it get stuck in select(): (gdb) bt #0 0x00000031bacde8b3 in select () from /lib64/libc.so.6 #1 0x00002aaaaaad62fe in wait_request (waittime=1, SState=0x0) at ../Libnet/net_server.c:457 #2 0x000000000041ecec in main_loop () #3 0x000000000041efbe in main () I was trying TORQUE 2.5.11 and TORQUE 3.0.4 - the same result (I can not use TORQUE 4). tracejob shows the next: 05/06/2012 19:27:45 S committing job 05/06/2012 19:27:45 S enqueuing into default, state 1 hop 1 05/06/2012 19:27:45 S dequeuing from default, state QUEUED 05/06/2012 19:27:45 S enqueuing into shortq, state 1 hop 1 05/06/2012 19:27:45 S Reply sent for request type Commit on socket 12 05/06/2012 19:27:45 A queue=default 05/06/2012 19:27:45 A queue=shortq 05/06/2012 19:27:45 S ready to commit job 05/06/2012 19:27:45 S ready to commit job completed 05/06/2012 19:41:23 S Job Run at request of root at ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 19:41:23 S forking in send_job 05/06/2012 19:41:29 S send of job to eu-west-1-director.cm.cluster failed error = 15091 05/06/2012 19:41:29 S entering post_sendmom 05/06/2012 19:41:29 S child reported failure for job after 6 seconds (dest=eu-west-1-director.cm.cluster), rc=1 05/06/2012 19:41:29 S unable to run job, MOM rejected/rc=1 05/06/2012 19:42:21 S Job Run at request of root at ts-trunk-sl6tor-tor.cm.cluster 05/06/2012 19:42:21 S forking in send_job 05/06/2012 19:42:21 S entering post_sendmom 05/06/2012 19:42:21 S child reported success for job after 0 seconds (dest=eu-west-1-director.cm.cluster), rc=0 05/06/2012 19:42:21 A user=cmsupport group=cmsupport jobname=myscript1 queue=shortq ctime=1336325265 qtime=1336325265 etime=1336325265 start=1336326141 owner=cmsupport at ts-trunk-sl6tor-tor.cm.clusterexec_host=eu-west-1-director.cm.cluster/0 Resource_List.neednodes=eu-west-1-director.cm.cluster Resource_List.nodect=1 Resource_List.nodes=eu-west-1-director.cm.cluster According tracejob, sometimes the job is rejected several times, but then always accepted and finished successfully. And then is waiting in some weird state. tcpdump showed that server is answering something to the compute node on port 15001 (here master is an another hostname of the head node, eu-west-1-director is a testing computing node): 20:36:50.894880 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) eu-west-1-director.cm.cluster.mdc-portmapper > master.cm.cluster.15001: Flags [R], cksum 0x6df5 (correct), seq 629114634, win 0, length 0 20:36:50.908759 IP (tos 0x0, ttl 64, id 19674, offset 0, flags [DF], proto TCP (6), length 60) eu-west-1-director.cm.cluster.asipregistry > master.cm.cluster.15001: Flags [S], cksum 0xa044 (correct), seq 1797822170, win 13600, options [mss 1360,sackOK,TS val 26860032 ecr 0,nop,wscale 6], length 0 20:36:50.908777 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) master.cm.cluster.15001 > eu-west-1-director.cm.cluster.asipregistry: Flags [S.], cksum 0x9778 (correct), seq 3749729188, ack 1797822171, win 13480, options [mss 1360,sackOK,TS val 92328077 ecr 26860032,nop,wscale 6], length 0 20:36:50.916246 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) eu-west-1-director.cm.cluster.hcp-wismar > master.cm.cluster.15001: Flags [R], cksum 0x32d3 (correct), seq 387109016, win 0, length 0 20:36:50.930024 IP (tos 0x0, ttl 64, id 31062, offset 0, flags [DF], proto TCP (6), length 60) eu-west-1-director.cm.cluster.realm-rusd > master.cm.cluster.15001: Flags [S], cksum 0xcf1c (correct), seq 802399040, win 13600, options [mss 1360,sackOK,TS val 26860054 ecr 0,nop,wscale 6], length 0 20:36:50.930044 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) master.cm.cluster.15001 > eu-west-1-director.cm.cluster.realm-rusd: Flags [S.], cksum 0xd8c3 (correct), seq 1627439004, ack 802399041, win 13480, options [mss 1360,sackOK,TS val 92328098 ecr 26860054,nop,wscale 6], length 0 In /var/log/messages I see a lot of "pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue". No errors in server_log are found. pbsnodes shows that node is up and executing the job. Stopping firewalls did not help. Re-compiling with using unprevileged ports did not help. So, I have no more ideas what to do to fix it. Any feedback would be highly appreciated! -- Taras -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120506/e9fde005/attachment-0001.html From ezhilalanrb at gmail.com Mon May 7 04:05:03 2012 From: ezhilalanrb at gmail.com (Ezhilalan Ramalingam) Date: Mon, 7 May 2012 11:05:03 +0100 Subject: [torqueusers] torque-4.0 installation Message-ID: <001301cd2c38$e07ea7c0$a17bf740$@com> Hi All, I have tried to install torque-4.0 on a SUSE 10.1 linux PC however got in to problems as below. I have 5 other Suse linux PCs linked to the master PC (linux-01), to start with I downloaded torque 4.0 (recent version) and was trying to install same. I followed the steps as per guide 4.0 but got stuck at the 'setup' stage. I noticed that under /usr/local/bin there was binary files i.e qmgr, qterm etc are not installed. I used the command ./configure and other commands as per manual but unable to figure out what went wrong. Could I get some help to figure out this problem? Regards, Ezhilalan linux-01:/home/ezhil/Torque/torque-4.0.0 # ./torque.setup ezhil initializing TORQUE (admin: ezhil at linux-01.physics) You have selected to start pbs_server in create mode. If the server database exists it will be overwritten. do you wish to continue y/(n)?y ./torque.setup: line 37: qmgr: command not found ERROR: cannot set TORQUE admins ./torque.setup: line 41: qterm: command not found -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/64fb3320/attachment.html From mraymond at sgi.com Mon May 7 07:50:05 2012 From: mraymond at sgi.com (Michael Raymond) Date: Mon, 07 May 2012 08:50:05 -0500 Subject: [torqueusers] Cpuset bug with pbs_track Message-ID: <4FA7D30D.5000505@sgi.com> I think I found a bug in tm.c. Consider the following scenario. Within a torque job run: mpirun -np 1 foo.sh pbs_track -j $PBS_JOBID -- ./a.out Where foo.sh simply runs the next command on the command line. #!/bin/bash "$@" In this situation pbs_track and a.out will not wind up in the job's cpuset. pbs_track calls tm_adopt() which says it sends the pid & sid of pbs_track to the local Torque server for tracking. The server takes the pid and puts the process into the assigned cpuset. tm_adopt() is then able to return and pbs_track runs the application. The problem is that tm_adopt() sends its sid when it means to be sending its own pid. This happens around line 1871 of tm.c. This puts foo.sh into the cpuset, but pbs_track has already forked out of it and doesn't get moved itself. Am I correct in thinking this is a bug? Thanks. -- Michael A. Raymond SGI MPT Team Leader (651) 683-3434 From knielson at adaptivecomputing.com Mon May 7 09:16:58 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 7 May 2012 09:16:58 -0600 Subject: [torqueusers] torque-4.0 installation In-Reply-To: <001301cd2c38$e07ea7c0$a17bf740$@com> References: <001301cd2c38$e07ea7c0$a17bf740$@com> Message-ID: On Mon, May 7, 2012 at 4:05 AM, Ezhilalan Ramalingam wrote: > Hi All,**** > > ** ** > > I have tried to install torque-4.0 on a SUSE 10.1 linux PC however got in > to problems as below.**** > > ** ** > > I have 5 other Suse linux PCs linked to the master PC (linux-01), to start > with I downloaded torque 4.0 (recent version) and was trying to install > same.**** > > ** ** > > I followed the steps as per guide 4.0 but got stuck at the ?setup? stage. > I noticed that under /usr/local/bin there was binary files i.e qmgr, qterm > etc are not installed.**** > > ** ** > > I used the command ./configure and other commands as per manual but unable > to figure out what went wrong.**** > > ** ** > > Could I get some help to figure out this problem? **** > > ** ** > > Regards,**** > > Ezhilalan**** > > ** ** > > ** ** > > ** ** > > linux-01:/home/ezhil/Torque/torque-4.0.0 # ./torque.setup ezhil**** > > initializing TORQUE (admin: ezhil at linux-01.physics)**** > > ** ** > > You have selected to start pbs_server in create mode.**** > > If the server database exists it will be overwritten.**** > > do you wish to continue y/(n)?y**** > > ./torque.setup: line 37: qmgr: command not found**** > > ERROR: cannot set TORQUE admins**** > > ./torque.setup: line 41: qterm: command not found**** > > ** ** > > Ezhilalan, TORQUE 4.0.1 was released last week. I would suggest working with that version rather than spending more time with 4.0.0. There are significant improvements in 4.0.1. Please let us know if you are still having problems and we will get them corrected. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/ae203cea/attachment.html From ezhilalanrb at gmail.com Mon May 7 10:35:51 2012 From: ezhilalanrb at gmail.com (Ezhilalan Ramalingam) Date: Mon, 7 May 2012 17:35:51 +0100 Subject: [torqueusers] torque-4.0 installation In-Reply-To: References: <001301cd2c38$e07ea7c0$a17bf740$@com> Message-ID: <001b01cd2c6f$786b1c90$694155b0$@com> Hi Ken, I'll certainly try installing torque 4.0.1 and feedback how I got on. Regards, Ezhilalan From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Monday, May 07, 2012 4:17 PM To: Torque Users Mailing List Subject: Re: [torqueusers] torque-4.0 installation On Mon, May 7, 2012 at 4:05 AM, Ezhilalan Ramalingam wrote: Hi All, I have tried to install torque-4.0 on a SUSE 10.1 linux PC however got in to problems as below. I have 5 other Suse linux PCs linked to the master PC (linux-01), to start with I downloaded torque 4.0 (recent version) and was trying to install same. I followed the steps as per guide 4.0 but got stuck at the 'setup' stage. I noticed that under /usr/local/bin there was binary files i.e qmgr, qterm etc are not installed. I used the command ./configure and other commands as per manual but unable to figure out what went wrong. Could I get some help to figure out this problem? Regards, Ezhilalan linux-01:/home/ezhil/Torque/torque-4.0.0 # ./torque.setup ezhil initializing TORQUE (admin: ezhil at linux-01.physics) You have selected to start pbs_server in create mode. If the server database exists it will be overwritten. do you wish to continue y/(n)?y ./torque.setup: line 37: qmgr: command not found ERROR: cannot set TORQUE admins ./torque.setup: line 41: qterm: command not found Ezhilalan, TORQUE 4.0.1 was released last week. I would suggest working with that version rather than spending more time with 4.0.0. There are significant improvements in 4.0.1. Please let us know if you are still having problems and we will get them corrected. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/466a7c79/attachment-0001.html From luiz at if.usp.br Mon May 7 12:47:56 2012 From: luiz at if.usp.br (Luiz Carlos dos Santos) Date: Mon, 7 May 2012 15:47:56 -0300 Subject: [torqueusers] RES: queue In-Reply-To: References: <012101cd2c5e$7057c280$51074780$@if.usp.br> Message-ID: <013e01cd2c81$ea885920$bf990b60$@if.usp.br> Hi, thanks for your help. 1) What's inside PBS_sub_script_test ? #!/bin/sh #PBS -N job_test # #PBS -e job_test.error # #PBS -o job_test.output # #PBS -l nodes=1 #ThisHost=`hostname` cd ./pbsteste ./NRG_main > output_pbs_test.txt echo "Finished in ${ThisHost}" > Finish_test cd -------------------------------------------- 2) which scheduler are you using, maui ? If it is, what's inside the maui.cfg? # maui.cfg 3.3.1 SERVERHOST wilson.if.usp.br # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[WILSON.IF.USP.BR] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR De: Ju JiaJia [mailto:jujj603 at gmail.com] Enviada em: segunda-feira, 7 de maio de 2012 11:45 Para: luiz at if.usp.br; Torque Users Mailing List Assunto: Re: [torqueusers] queue What's inside PBS_sub_script_test ? And which scheduler are you using, maui ? If it is, what's inside the maui.cfg? On Mon, May 7, 2012 at 10:33 PM, Luiz Carlos dos Santos wrote: Hi all, I create a queue ?long3? directed for a determined node, but it is not working correctly. When I execute I have the next error message: --------------------------------------------------------------------------- [luiz at wilson pbsteste]$ qsub PBS_sub_script_test qsub: Job rejected by all possible destinations (check syntax, queue resources, ...) --------------------------------------------------------------------------- Please, could you help me? I include down the configuration of the queue. # create queue long3 set queue long3 queue_type = Execution set queue long3 Priority = 60 set queue long3 max_running = 10 set queue long3 acl_host_enable = True set queue long3 acl_hosts = wilson03 set queue long3 resources_max.cput = 24:00:00 set queue long3 resources_min.cput = 00:00:01 set queue long3 resources_default.cput = 12:00:00 set queue long3 enabled = True set queue long3 started = True # # Create and define queue default # create queue default set queue default queue_type = Route set queue default max_running = 10 set queue default route_destinations = long3 set queue default enabled = True set queue default started = True ------------------------------------------------------- Thanks, Luiz Carlos dos Santos Analista de Sistemas ? IFUSP/FMT Instituto de F?sica da USP Departamento de F?sica dos Materiais e Mec?nica P?a. do Oceanogr?fico - Trav E, s/n? Edif?cio Alessandro Volta, Bloco C - sala 112 CEP 05508-120 ? S?o Paulo SP Fone: (11) 3091-6784 / Fax: (11) 3091-6831 E-mail: luiz at if.usp.br _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/23566e34/attachment-0001.html From dbeer at adaptivecomputing.com Mon May 7 16:53:59 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 7 May 2012 16:53:59 -0600 Subject: [torqueusers] Cpuset bug with pbs_track In-Reply-To: <4FA7D30D.5000505@sgi.com> References: <4FA7D30D.5000505@sgi.com> Message-ID: Mark, You are correct that this is a bug. It should write pid and not sid on the second int that is sent. We will get this checked in. If you would like to use the updated patch, it is attached. David On Mon, May 7, 2012 at 7:50 AM, Michael Raymond wrote: > I think I found a bug in tm.c. > > Consider the following scenario. Within a torque job run: > > mpirun -np 1 foo.sh pbs_track -j $PBS_JOBID -- ./a.out > > Where foo.sh simply runs the next command on the command line. > > #!/bin/bash > "$@" > > In this situation pbs_track and a.out will not wind up in the job's > cpuset. pbs_track calls tm_adopt() which says it sends the pid & sid of > pbs_track to the local Torque server for tracking. The server takes the > pid and puts the process into the assigned cpuset. tm_adopt() is then > able to return and pbs_track runs the application. The problem is that > tm_adopt() sends its sid when it means to be sending its own pid. This > happens around line 1871 of tm.c. This puts foo.sh into the cpuset, but > pbs_track has already forked out of it and doesn't get moved itself. > > Am I correct in thinking this is a bug? Thanks. > > -- > Michael A. Raymond > SGI MPT Team Leader > (651) 683-3434 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/984ae7d0/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: WritePid.patch Type: application/octet-stream Size: 472 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120507/984ae7d0/attachment.obj From s.breedveld at erasmusmc.nl Tue May 8 06:05:01 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Tue, 08 May 2012 14:05:01 +0200 Subject: [torqueusers] Simple Torque+Maui setup: jobs stay queued, no resources [SOLVED] In-Reply-To: References: <4FA133F2.80303@erasmusmc.nl> <2445104.utCSrFlz9W@newton.cc.uit.no> <4FA393B1.5000108@erasmusmc.nl> <4FA42581.4010901@erasmusmc.nl> <4FA4EBFA.2040408@erasmusmc.nl> Message-ID: <4FA90BED.8030802@erasmusmc.nl> On 05/08/2012 01:18 PM, Ju JiaJia wrote: > I have configed my torque and maui with your configuration. Sadly, > jobs just queued. > I checked the documentation of torque, there is no > resources_default.neednodes, legal attributes of resources_default are > arch, mem, nodes, ncpus, nodect, procct, pvmem, and walltime. > So i unset resources_default.neednodes, and restart maui. everything > just works fine. > Note: maui must be restarted, or nothing will change. > > I run torque and maui on my notebook, so one node is all i have. I > test with this command : > echo "HELLO" | qsub -l mem=10mb,pvmem=10mb -q batch > > Yes, now it works! Thank you very much! As a summary (for others getting stuck): the problem was that resources_default.mem and resources_default.pvmem was not set in the default configuration (and I did not specify it with qsub, as I am only interesed in ppn for my cases). As a result jobs were not executed because of insufficient memory resources on the nodes. I copy/pasted different suggested queue configurations, and probably the neednodes was included somewhere, and jobs got stuck on that one after solving the memory issue. (however, unsetting default memory resources now also works, so I have now idea why it did not work to begin with...) Super, now I can finally start learning Maui. Thanks! Sebastiaan > > On Sat, May 5, 2012 at 4:59 PM, Sebastiaan Breedveld > > wrote: > > On 05/05/2012 04:07 AM, Ju JiaJia wrote: >> Can you run qsub -q batch -l pvmem=10mb,mem=10mb test-script.sh >> with both normal user and root ? >> >> > > The scheduler does not allow me to submit jobs as root. > > As a normal user, the job is still not executed: > > checking job 63 (RM job '63.testing.azr.nl > ') > > > State: Idle EState: Deferred > Creds: user:sebastiaan group:sebastiaan class:batch qos:DEFAULT > WallTime: 00:00:00 of 6:00:00 > SubmitTime: Sat May 5 07:53:05 > (Time Queued Total: 3:04:41 Eligible: 00:00:00) > > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 10M > > Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 MEM: 10M SWAP: 10M > > NodeAccess: SHARED > NodeCount: 1 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > SystemQueueTime: Sat May 5 10:54:09 > > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation > for job '63' (intital reservation attempt) > > ) > Holds: Defer (hold reason: NoResources) > PE: 1.00 StartPriority: 3 > cannot select job 63 for partition DEFAULT (job hold active) > > >> On Sat, May 5, 2012 at 2:52 AM, Sebastiaan Breedveld >> > wrote: >> >> Dear Ju, >> >> >> >> On 05/04/2012 01:33 PM, Ju JiaJia wrote: >>> How many processes and nodes used in you test-script.sh ? >> Nothing special defined: >> >> #!/bin/bash >> # >> # This is the unique output file for this job >> OUTPUTFILE=test_output_${PBS_JOBID}.log >> # Print >> echo "Output for this job:" > $OUTPUTFILE >> # Do something >> sleep 1m >> # Get a list of environment variables >> date >> $OUTPUTFILE >> # Do something more >> sleep 5s >> # quit >> exit 0 >> >> >> >>> Check maui.cfg to see whether you have enough right to run >>> multi processes. If you don't, you job will just hang and >>> without any error messages. >>> In maui.cfg, something like this: >>> USERCFG[your-user-name] MAXJOB=1 MAXQUEUEDJOB=1 MAXNODE=1 >>> MAXPROC=1 >>> and default you can only run one process on one node. >>> >> There are no lines like that. The maui.cfg is untouched: >> >> >> # maui.cfg 3.3.1 >> SERVERHOST testing >> ADMIN1 root >> RMCFG[ULURU-TESTING] TYPE=PBS >> AMCFG[bank] TYPE=NONE >> >> RMPOLLINTERVAL 00:00:30 >> SERVERPORT 42559 >> SERVERMODE NORMAL >> LOGFILE maui.log >> LOGFILEMAXSIZE 10000000 >> LOGLEVEL 3 >> QUEUETIMEWEIGHT 1 >> BACKFILLPOLICY FIRSTFIT >> RESERVATIONPOLICY CURRENTHIGHEST >> NODEALLOCATIONPOLICY MINRESOURCE >> >> Sincerely, >> Sebastiaan >> >> >>> On Fri, May 4, 2012 at 4:30 PM, Sebastiaan Breedveld >>> > >>> wrote: >>> >>> Dear Roy, >>> >>> On 05/03/2012 01:51 PM, Roy Dragseth wrote: >>> > It seems like you are trying to submit a job wich >>> requires 15 GB pvmem on a >>> > nodes that have less than 3 GB total virtual memory. >>> Maui also considers >>> > specified memory parameters when scheduling jobs. >>> Thank you for the hint. I have now set the default >>> resources for memory >>> to 1b in the queue: >>> >>> resources_max.ncpus = 8 >>> resources_max.nodect = 10 >>> resources_max.nodes = 2 >>> resources_min.ncpus = 0 >>> resources_default.mem = 1b >>> resources_default.ncpus = 1 >>> resources_default.neednodes = 1:ppn=1 >>> resources_default.nodect = 1 >>> resources_default.nodes = 1 >>> resources_default.pvmem = 1b >>> >>> and additionally submitted a job setting low requirements: >>> >>> $ qsub -q batch -l pvmem=1,mem=1 test-script.sh >>> >>> Unfortunately, the job still does not run: >>> >>> # checkjob -v 61 >>> >>> >>> checking job 61 (RM job '61.testing.azr.nl >>> ') >>> >>> State: Idle EState: Deferred >>> Creds: user:sebastiaan group:sebastiaan class:batch >>> qos:DEFAULT >>> WallTime: 00:00:00 of 6:00:00 >>> SubmitTime: Fri May 4 10:27:03 >>> (Time Queued Total: 00:01:34 Eligible: 00:00:01) >>> >>> Total Tasks: 1 >>> >>> Req[0] TaskCount: 1 Partition: ALL >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 1M >>> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> Dedicated Resources Per Task: PROCS: 1 MEM: 1M SWAP: 1M >>> NodeAccess: SHARED >>> NodeCount: 1 >>> >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 0 StartCount: 0 >>> PartitionMask: [ALL]u >>> Flags: RESTARTABLE >>> >>> job is deferred. Reason: NoResources (cannot create >>> reservation for >>> job '61' (intital reservation attempt) >>> ) >>> Holds: Defer (hold reason: NoResources) >>> PE: 1.00 StartPriority: 1 >>> cannot select job 61 for partition DEFAULT (job hold active) >>> >>> >>> Any other hints? >>> >>> Thanks in advance, >>> Sebastiaan >>> >>> >>> > r. >>> > >>> > >>> > On Wednesday, May 02, 2012 15:17:38 Sebastiaan >>> Breedveld wrote: >>> >> Dear list, >>> >> >>> >> This may be a Maui issue, but the Maui list seems dead :( >>> >> >>> >> >>> >> I am trying to setup a very basic Torque+Maui system. >>> I am running a >>> >> Torque cluster for a year now, and wanted to improve >>> the scheduling with >>> >> Maui. To this end, I installed a fresh test-system, >>> with server and node >>> >> on a single computer. >>> >> >>> >> Torque version: 2.4.16 >>> >> Maui version: 3.3.1 >>> >> uname: Linux testing 3.2.0-20-generic #33-Ubuntu SMP >>> Tue Mar 27 16:42:26 >>> >> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux >>> >> >>> >> >>> >> >>> >> I was able to run (simple) jobs with the Torque >>> scheduler. When I >>> >> replaced the scheduler with Maui, jobs stay queued. >>> Jobs are submitted by: >>> >> >>> >> $ qsub -q batch test-script.sh >>> >> >>> >> where test-script.sh is nothing more than a 'sleep >>> 1m' script. Checking >>> >> the job: >>> >> >>> >> # checkjob -v 55 >>> >> checking job 55 (RM job '55.testing.azr.nl >>> ') >>> >> >>> >> State: Idle EState: Deferred >>> >> Creds: user:sebastiaan group:sebastiaan >>> class:batch qos:DEFAULT >>> >> WallTime: 00:00:00 of 6:00:00 >>> >> SubmitTime: Thu Apr 5 13:21:33 >>> >> (Time Queued Total: 00:00:32 Eligible: 00:00:01) >>> >> >>> >> Total Tasks: 1 >>> >> >>> >> Req[0] TaskCount: 1 Partition: ALL >>> >> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 15G >>> >> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >>> >> Exec: '' ExecSize: 0 ImageSize: 0 >>> >> Dedicated Resources Per Task: PROCS: 1 MEM: 2000M >>> SWAP: 15G >>> >> NodeAccess: SHARED >>> >> NodeCount: 1 >>> >> >>> >> >>> >> IWD: [NONE] Executable: [NONE] >>> >> Bypass: 0 StartCount: 0 >>> >> PartitionMask: [ALL] >>> >> Flags: RESTARTABLE >>> >> >>> >> job is deferred. Reason: NoResources (cannot >>> create reservation for >>> >> job '55' (intital reservation attempt) >>> >> ) >>> >> Holds: Defer (hold reason: NoResources) >>> >> PE: 16.03 StartPriority: 1 >>> >> cannot select job 55 for partition DEFAULT (job hold >>> active) >>> >> >>> >> >>> >> >>> >> show that there are no resources available. The node >>> is free, and unloaded: >>> >> >>> >> # checknode testing >>> >> >>> >> >>> >> checking node testing.azr.nl >>> >> >>> >> State: Idle (in current state for 2:23:54) >>> >> Configured Resources: PROCS: 2 MEM: 984M SWAP: >>> 1996M DISK: 1M >>> >> Utilized Resources: SWAP: 149M >>> >> Dedicated Resources: [NONE] >>> >> Opsys: linux Arch: [NONE] >>> >> Speed: 1.00 Load: 0.050 >>> >> Network: [DEFAULT] >>> >> Features: [NONE] >>> >> Attributes: [Batch] >>> >> Classes: [batch 2:2] >>> >> >>> >> Total Time: 16:11:49 Up: 16:11:49 (100.00%) Active: >>> 00:01:00 (0.10%) >>> >> >>> >> Reservations: >>> >> NOTE: no reservations on node >>> >> >>> >> >>> >> >>> >> When the job is added, maui.log shows this: >>> >> 04/05 13:21:34 MPBSJobLoad(55,55.testing.azr.nl >>> ,J,TaskList,0) >>> >> 04/05 13:21:34 MReqCreate(55,SrcRQ,DstRQ,DoCreate) >>> >> 04/05 13:21:34 INFO: processing node request line '1' >>> >> 04/05 13:21:34 MJobSetCreds(55,sebastiaan,sebastiaan,) >>> >> 04/05 13:21:34 INFO: default QOS for job 55 set >>> to DEFAULT(0) >>> >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >>> >> 04/05 13:21:34 INFO: default QOS for job 55 set >>> to DEFAULT(0) >>> >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >>> >> 04/05 13:21:34 INFO: default QOS for job 55 set >>> to DEFAULT(0) >>> >> (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE]) >>> >> 04/05 13:21:34 INFO: job '55' loaded: 1 >>> sebastiaan sebastiaan >>> >> 21600 Idle 0 1333624893 [NONE] [NONE] >>> [NONE]>= 0>= 0 >>> >> [1][ppn=1] 1333624894 >>> >> 04/05 13:21:34 INFO: 12 PBS jobs detected on RM >>> TESTING >>> >> 04/05 13:21:34 INFO: jobs detected: 12 >>> >> 04/05 13:21:34 MStatClearUsage(node,Active) >>> >> 04/05 13:21:34 MClusterUpdateNodeState() >>> >> 04/05 13:21:34 >>> MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg) >>> >> 04/05 13:21:34 INFO: job '40' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '41' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '42' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '44' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '45' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '47' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '48' Priority: 16 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '49' Priority: 12 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '52' Priority: 8 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '53' Priority: 1 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '54' Priority: 60 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '55' Priority: 1 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 MStatClearUsage([NONE],Active) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 INFO: total jobs selected (ALL): >>> 1/12 [EState: 11] >>> >> 04/05 13:21:34 >>> MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg) >>> >> 04/05 13:21:34 INFO: job '40' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '41' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '42' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '44' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '45' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '47' Priority: 22 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 22(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '48' Priority: 16 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 16(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '49' Priority: 12 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 12(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '52' Priority: 8 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 8(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '53' Priority: 1 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '54' Priority: 60 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 60(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 INFO: job '55' Priority: 1 >>> >> 04/05 13:21:34 INFO: Cred: 0(00.0) FS: >>> 0(00.0) Attr: >>> >> 0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: >>> 0(00.0) Us: >>> >> 0(00.0) >>> >> 04/05 13:21:34 MStatClearUsage([NONE],Idle) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 MResDestroy(NULL) >>> >> 04/05 13:21:34 INFO: total jobs selected (ALL): >>> 1/12 [EState: 11] >>> >> 04/05 13:21:34 >>> >> >>> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE) >>> >> 04/05 13:21:34 INFO: total jobs selected in >>> partition ALL: 1/1 >>> >> 04/05 13:21:34 MQueueScheduleRJobs(Q) >>> >> 04/05 13:21:34 >>> >> >>> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) >>> >> 04/05 13:21:34 INFO: total jobs selected in >>> partition ALL: 1/1 >>> >> 04/05 13:21:34 >>> >> >>> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) >>> >> 04/05 13:21:34 INFO: total jobs selected in >>> partition DEFAULT: 1/1 >>> >> 04/05 13:21:34 MQueueScheduleIJobs(Q,DEFAULT) >>> >> 04/05 13:21:34 INFO: 0 feasible tasks found for >>> job 55:0 in >>> >> partition DEFAULT (1 Needed) >>> >> 04/05 13:21:34 >>> MJobPReserve(55,DEFAULT,ResCount,ResCountRej) >>> >> 04/05 13:21:34 MJobReserve(55,Priority) >>> >> 04/05 13:21:34 ALERT: job 55 cannot run in any >>> partition >>> >> 04/05 13:21:34 ALERT: cannot create new >>> reservation for job 55 >>> >> (shape[1] 1) >>> >> 04/05 13:21:34 ALERT: cannot create new >>> reservation for job 55 >>> >> 04/05 13:21:34 >>> MJobSetHold(55,16,1:00:00,NoResources,cannot create >>> >> reservation for job '55' (intital reservation attempt) >>> >> ) >>> >> 04/05 13:21:34 ALERT: job '55' cannot run >>> (deferring job for 3600 >>> >> seconds) >>> >> 04/05 13:21:34 WARNING: cannot reserve priority job '55' >>> >> Active Jobs------ >>> >> ------------------ >>> >> 04/05 13:21:34 INFO: resources available after >>> scheduling: N: 1 P: 2 >>> >> 04/05 13:21:34 >>> >> >>> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE) >>> >> 04/05 13:21:34 INFO: total jobs selected in >>> partition DEFAULT: 0/1 >>> >> [EState: 1] >>> >> 04/05 13:21:34 >>> >> >>> MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE) >>> >> 04/05 13:21:34 INFO: total jobs selected in >>> partition ALL: 0/1 >>> >> [EState: 1] >>> >> 04/05 13:21:34 >>> >> >>> MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE) >>> >> 04/05 13:21:34 INFO: total jobs selected in >>> partition ALL: 0/1 >>> >> [EState: 1] >>> >> 04/05 13:21:34 MSchedUpdateStats() >>> >> 04/05 13:21:34 INFO: iteration: 288 scheduling >>> time: 0.002 seconds >>> >> 04/05 13:21:34 MResUpdateStats() >>> >> 04/05 13:21:34 INFO: current util[288]: 0/1 >>> (0.00%) PH: 0.00% >>> >> active jobs: 0 of 2 (completed: 1) >>> >> 04/05 13:21:34 MQueueCheckStatus() >>> >> 04/05 13:21:34 MNodeCheckStatus() >>> >> 04/05 13:21:34 MUClearChild(PID) >>> >> 04/05 13:21:34 INFO: scheduling complete. >>> sleeping 30 seconds >>> >> >>> >> >>> >> I think the relevant line is: >>> >> 04/05 13:21:34 INFO: 0 feasible tasks found for >>> job 55:0 in >>> >> partition DEFAULT (1 Needed) >>> >> >>> >> but I have no idea how to make a feasible task for >>> the job. I have tried >>> >> queueing with -l nodes=1:ppn=1 -l walltime=2:00:00, >>> etc. but none seem >>> >> to have had effect. >>> >> >>> >> >>> >> >>> >> Torque config. I have tried setting different >>> attributes to the queue >>> >> properties, hoping that it would have some effect: >>> >> # qmgr -c "p s" >>> >> # >>> >> # Create queues and set their attributes. >>> >> # >>> >> # >>> >> # Create and define queue batch >>> >> # >>> >> create queue batch >>> >> set queue batch queue_type = Execution >>> >> set queue batch Priority = 20 >>> >> set queue batch max_running = 8 >>> >> set queue batch resources_max.ncpus = 8 >>> >> set queue batch resources_max.nodect = 10 >>> >> set queue batch resources_max.nodes = 2 >>> >> set queue batch resources_min.ncpus = 0 >>> >> set queue batch resources_default.mem = 2000mb >>> >> set queue batch resources_default.ncpus = 1 >>> >> set queue batch resources_default.neednodes = 1:ppn=1 >>> >> set queue batch resources_default.nodect = 1 >>> >> set queue batch resources_default.nodes = 1 >>> >> set queue batch resources_default.pvmem = 16000mb >>> >> set queue batch resources_default.walltime = 06:00:00 >>> >> set queue batch enabled = True >>> >> set queue batch started = True >>> >> # >>> >> # Set server attributes. >>> >> # >>> >> set server scheduling = True >>> >> set server acl_hosts = testing.azr.nl >>> >>> >> set server log_events = 511 >>> >> set server mail_from = adm >>> >> set server resources_available.nodect = 10 >>> >> set server scheduler_iteration = 600 >>> >> set server node_check_rate = 150 >>> >> set server tcp_timeout = 6 >>> >> set server next_job_number = 56 >>> >> >>> >> >>> >> Maui configuration, untouched: >>> >> # maui.cfg 3.3.1 >>> >> >>> >> SERVERHOST testing >>> >> # primary admin must be first in list >>> >> ADMIN1 root >>> >> >>> >> # Resource Manager Definition >>> >> >>> >> RMCFG[TESTING] TYPE=PBS >>> >> >>> >> # Allocation Manager Definition >>> >> >>> >> AMCFG[bank] TYPE=NONE >>> >> >>> >> # full parameter docs at >>> http://supercluster.org/mauidocs/a.fparameters.html >>> >> # use the 'schedctl -l' command to display current >>> configuration >>> >> >>> >> RMPOLLINTERVAL 00:00:30 >>> >> >>> >> SERVERPORT 42559 >>> >> SERVERMODE NORMAL >>> >> >>> >> # Admin: >>> http://supercluster.org/mauidocs/a.esecurity.html >>> >> >>> >> >>> >> LOGFILE maui.log >>> >> LOGFILEMAXSIZE 10000000 >>> >> LOGLEVEL 3 >>> >> >>> >> # Job Priority: >>> http://supercluster.org/mauidocs/5.1jobprioritization.html >>> >> >>> >> QUEUETIMEWEIGHT 1 >>> >> >>> >> # FairShare: >>> http://supercluster.org/mauidocs/6.3fairshare.html >>> >> >>> >> #FSPOLICY PSDEDICATED >>> >> #FSDEPTH 7 >>> >> #FSINTERVAL 86400 >>> >> #FSDECAY 0.80 >>> >> >>> >> # Throttling Policies: >>> >> >>> http://supercluster.org/mauidocs/6.2throttlingpolicies.html >>> >> >>> >> # NONE SPECIFIED >>> >> >>> >> # Backfill: >>> http://supercluster.org/mauidocs/8.2backfill.html >>> >> >>> >> BACKFILLPOLICY FIRSTFIT >>> >> RESERVATIONPOLICY CURRENTHIGHEST >>> >> >>> >> # Node Allocation: >>> http://supercluster.org/mauidocs/5.2nodeallocation.html >>> >> >>> >> NODEALLOCATIONPOLICY MINRESOURCE >>> >> >>> >> # QOS: http://supercluster.org/mauidocs/7.3qos.html >>> >> >>> >> # QOSCFG[hi] PRIORITY=100 XFTARGET=100 >>> FLAGS=PREEMPTOR:IGNMAXJOB >>> >> # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE >>> >> >>> >> # Standing Reservations: >>> >> >>> http://supercluster.org/mauidocs/7.1.3standingreservations.html >>> >> >>> >> # SRSTARTTIME[test] 8:00:00 >>> >> # SRENDTIME[test] 17:00:00 >>> >> # SRDAYS[test] MON TUE WED THU FRI >>> >> # SRTASKCOUNT[test] 20 >>> >> # SRMAXTIME[test] 0:30:00 >>> >> >>> >> # Creds: >>> http://supercluster.org/mauidocs/6.1fairnessoverview.html >>> >> >>> >> # USERCFG[DEFAULT] FSTARGET=25.0 >>> >> # USERCFG[john] PRIORITY=100 FSTARGET=10.0- >>> >> # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low >>> QDEF=hi >>> >> # CLASSCFG[batch] FLAGS=PREEMPTEE >>> >> # CLASSCFG[interactive] FLAGS=PREEMPTOR >>> >> >>> >> >>> >> >>> >> Any ideas? >>> >> >>> >> Thanks in advance, >>> >> Sebastiaan >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > -- Sebastiaan Breedveld, MSc. Ph.D. student Erasmus MC - Daniel den Hoed Cancer Center Department of Radiation Oncology Groene Hilledijk 301 3075 EA Rotterdam The Netherlands Phone: +31 10 7042693 Room: Gs-20 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120508/e8948ffc/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: s_breedveld.vcf Type: text/x-vcard Size: 365 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120508/e8948ffc/attachment-0001.vcf From gas5x at yahoo.com Tue May 8 09:01:49 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Tue, 8 May 2012 08:01:49 -0700 (PDT) Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120504184950.GW9750@lbl.gov> Message-ID: <1336489309.96118.YahooMailClassic@web111303.mail.gq1.yahoo.com> Dear Michael, Thank you very much for the patch and instructions. Now, I've unleashed the NHC on our production cluster. It was immediately useful for I've discovered that a few nodes were different (poor IB speed, different disk partitioning, that is, swap sizes). The problem is, NHC kills itself after $TIMEOUT which is by default 5 sec., and this also offlines the node. Which happens a bit too often -- some nodes with large memory request do swap slightly, and many nodes get offlined. Are there any rules of thumb/considerations on the best values of $TIMEOUT and how they correspond to the intervals in the Torque MOM config file? -- Grigory Shamov HPC Analyst University of Manitoba --- On Fri, 5/4/12, Michael Jennings wrote: > From: Michael Jennings > Subject: Re: [torqueusers] strategies for bad nodes > To: torqueusers at supercluster.org > Date: Friday, May 4, 2012, 11:49 AM > On Friday, 04 May 2012, at 09:42:47 > (-0700), > Grigory Shamov wrote: > > > Thank you for making the NHC available to the public, > and well documented. > > Thank *you* for your very valuable feedback, and for taking > the time > to troubleshoot the problems you encountered!? :-) > > As a token of appreciation, here's an undocumented > tip:? Any variable > NHC uses, including the path and filename for the > configuration file, > can be overridden in /etc/sysconfig/nhc.? This can be > used to change > the value of $PATH, alter runtime options, etc.? It's > sourced, so any > valid shell commands can be run there. > > > I dont know what the users' problem with it; but I > could guess that > > it perhaps didnt work "out of the box". First, the > nhc.conf has only > > examples, which of course always fail. But this is well > documented > > already. > > Yes, you're right.? Do you think it would be better if > the examples > were all commented out so that the check would "succeed" > out-of-the-box?? Or would that be less valuable? > > > Second, the helper scripts that online/offline the > nodes have > > assumptions on where the pbsnodes command is and which > user runs it; > > which in most cases have to be changed. On most of the > Rocks > > installations, for example, Torque lives under > /opt/torque, not > > /usr/local/torque. > > Good point.? We always use the TORQUE RPMs, so the > paths are correct > for us, but I certainly realize that many don't. > > So I've just committed the attached patch which does the > following: > > - Removes debugging stuff for UIDs > 100 so that NHC can > potentially > ???be run as non-root in a normal fashion. > - Convert node online/offline scripts to use variables and > $PATH to > ???identify where the "pbsnodes" command is > and what arguments it > ???should take. > - Add an "eval" to the execution of the check so that shell > variables > ???can be used or altered in config files. > > With this patch, you can put things in your config file like > the > following: > > * || export PBSNODES=/opt/torque/bin/pbsnodes > *.testbed || export MARK_OFFLINE=0 > > or even > > *.cluster1 || NETDEV=eth0 > *.cluster2 || NETDEV=eth1 > *.cluster3 || NETDEV=virbr0 > *.cluster? || check_hw_eth $NETDEV > > > For me the NHC worked after I fixed these (trivial) > things; I also > > use not 'root' but a dummy user to call pbsnodes from > helper scripts > > (making the dummy user the operator for local cluster > network). > > Something like the following should now work for you > out-of-the-box > with the attached patch: > > * || export PBSNODES="su - pbsadmin -c > /opt/torque/bin/pbsnodes" > > Thanks again for your comments! > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E? ? ? ? W: 510-495-2687 > MS 050B-3209? ? ? ? ? F: > 510-486-8615 > > -----Inline Attachment Follows----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From mej at lbl.gov Tue May 8 13:36:03 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 8 May 2012 12:36:03 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <1336489309.96118.YahooMailClassic@web111303.mail.gq1.yahoo.com> References: <20120504184950.GW9750@lbl.gov> <1336489309.96118.YahooMailClassic@web111303.mail.gq1.yahoo.com> Message-ID: <20120508193602.GO9750@lbl.gov> On Tuesday, 08 May 2012, at 08:01:49 (-0700), Grigory Shamov wrote: > Thank you very much for the patch and instructions. Now, I've > unleashed the NHC on our production cluster. It was immediately > useful for I've discovered that a few nodes were different (poor IB > speed, different disk partitioning, that is, swap sizes). Glad to hear it! :-) > The problem is, NHC kills itself after $TIMEOUT which is by default > 5 sec., and this also offlines the node. Which happens a bit too > often -- some nodes with large memory request do swap slightly, and > many nodes get offlined. > > Are there any rules of thumb/considerations on the best values of > $TIMEOUT and how they correspond to the intervals in the Torque MOM > config file? The short answer to your question is: I'm not sure yet. We use it heavily in production on most of our clusters here, and a 5-second timeout is long enough to cover anywhere from 99.9% to 99.98% of the runs on our busiest clusters. However, all sites are different. The only way you'll know for sure is to test. For reference, on the oldest and most hardware-varied of our 90th-percentile clusters, here are the figures: 5 sec -- 99.9% 10 sec -- 99.996% 15 sec -- 99.9998% 20 sec -- 99.9999% So on a similar system, the "5 9's rule" would dictate a timeout of 15 seconds. Time required will vary based on core count, idle cycles available, RAM consumption, etc. The other key factor is how many checks you have that require external commands to gather data. A check script with lots of backquotes and pipelines will have significantly more context switches (and thus go a lot slower) than one that's more self-contained. In any event, I wouldn't recommend setting it any higher than the MOM interval (45 seconds by default), minus a little bit for a safety net. Maybe 40 seconds for an upper limit? Just my conjecture on that, though; maybe Ken or David could comment more on the repercussions of bumping up against the interval limit. HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From gas5x at yahoo.com Tue May 8 17:20:39 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Tue, 8 May 2012 16:20:39 -0700 (PDT) Subject: [torqueusers] strategies for bad nodes Message-ID: <1336519239.82806.YahooMailClassic@web111304.mail.gq1.yahoo.com> Dear Michael, Thank you for the answers! --- On Tue, 5/8/12, Michael Jennings wrote: > > The problem is, NHC kills itself after $TIMEOUT which > is by default > > 5 sec., and this also offlines the node. Which happens > a bit too > > often -- some nodes with large memory request do swap > slightly, and > > many nodes get offlined. > > > > Are there any rules of thumb/considerations on the best > values of > > $TIMEOUT and how they correspond to the intervals in > the Torque MOM > > config file? > > The short answer to your question is:? I'm not sure > yet. > > We use it heavily in production on most of our clusters > here, and a > 5-second timeout is long enough to cover anywhere from 99.9% > to 99.98% > of the runs on our busiest clusters. > > However, all sites are different.? The only way you'll > know for sure > is to test.? For reference, on the oldest and most > hardware-varied of > our 90th-percentile clusters, here are the figures: > > 5 sec? -- 99.9% > 10 sec -- 99.996% > 15 sec -- 99.9998% > 20 sec -- 99.9999% > > So on a similar system, the "5 9's rule" would dictate a > timeout of 15 > seconds. Thanks! > Time required will vary based on core count, idle cycles > available, > RAM consumption, etc.? The other key factor is how many > checks you > have that require external commands to gather data.? A > check script > with lots of backquotes and pipelines will have > significantly more > context switches (and thus go a lot slower) than one that's > more > self-contained. I've found that the most time-consuming checks would be the check_ps ones. So if all of them are switched off, NHC usually finishes in 5 sec. These tests are potentially useful though. > In any event, I wouldn't recommend setting it any higher > than the MOM > interval (45 seconds by default), minus a little bit for a > safety > net.? Maybe 40 seconds for an upper limit?? Just > my conjecture on > that, though; maybe Ken or David could comment more on the > repercussions of bumping up against the interval limit. Am I understanding correctly that MOM would be blocked, waiting while a healthcheck is being executed? -- Grigory Shamov HPC Analyst, University of Manitoba From mej at lbl.gov Tue May 8 18:00:12 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 8 May 2012 17:00:12 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <1336519239.82806.YahooMailClassic@web111304.mail.gq1.yahoo.com> References: <1336519239.82806.YahooMailClassic@web111304.mail.gq1.yahoo.com> Message-ID: <20120509000012.GV9750@lbl.gov> On Tuesday, 08 May 2012, at 16:20:39 (-0700), Grigory Shamov wrote: > I've found that the most time-consuming checks would be the check_ps > ones. So if all of them are switched off, NHC usually finishes in 5 > sec. These tests are potentially useful though. Not surprising, since the only thing I couldn't fully internalize into bash-ese was the gathering of info on running processes. :-) That said, I'm concerned that multiple check_ps checks would add significantly more time *per check*. Once you have one or more check_ps checks in your config, you'll already pay the price of running the ps command, but it should ONLY run one time. After that, all other check_ps checks should pull their data from the cache, so the only additional penalty would be iterating through the cache array and performing the logic. So you should be seeing a hit on the first check, but additional checks should not carry a significant penalty (or at least not nearly as great as the first one). Is that what you're seeing, or are you seeing the same penalty per check_ps check? In the latter case, that would tend to indicate a bug. Just to rule out the possibility of running jobs affecting the tests, you may want to try some test runs of NHC on idle nodes to see how long they take. On idle semi-modern hardware (dual-socket 6-core Intel X5650s at 2.67GHz, 96GB RAM), out of 654,885 total NHC runs, I saw none at all that look longer than 4 seconds. (We have 4 check_ps checks active in our config on that system.) > Am I understanding correctly that MOM would be blocked, waiting > while a healthcheck is being executed? Yes, that is currently the case. Until TORQUE runs the health check in a thread (hopefully later on in the 4.x development series), pbs_mom will block completely while NHC is running. That's part of why the efficiency and watchdog mechanism in NHC is so critical. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From urbanjost at comcast.net Tue May 8 20:38:43 2012 From: urbanjost at comcast.net (John S. Urban) Date: Tue, 8 May 2012 22:38:43 -0400 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120509000012.GV9750@lbl.gov> References: <1336519239.82806.YahooMailClassic@web111304.mail.gq1.yahoo.com> <20120509000012.GV9750@lbl.gov> Message-ID: <00F1C5DF91944762A633CF32A01AA8A2@urbanjsPC> The problems you describe ( resources used, nodes closing, lockup on a hung process) are handled more easily by the method I use. A series of daemons and cron jobs running at different frequencies completely independent of the scheduler(s) calculates a numeric rank and places the value in a file in /tmp/RANK. If the rank is negative (or the file is more than 10 minutes old or unreadable) jobs do not start on the node. This is accomplished in PBS/Torque by setting a resource that all jobs require called GOOD. In LSF the value is used by an elim to set a node resource called "rank" and all queues are set to require a positive rank (-R 'rank>=1') and so on. We use a variety of schedulers. Temporary issues such as a file system being unavailable do not cause the node to actually close. New jobs just do not start while the rank is negative. Different checks can be made at different frequencies. Even if they did lock up the check of the /tmp/RANK file would not. The check is scheduler-independent and easily used by other utilities such as Nagios, LSF/RTM, and feeds into databases that let you see when and how often node failures occurr graphically. I've used this method since before PBS even existed and find it much easier to control than the current methods being added to schedulers. If you add the rank rating calculation to the epilogue so that all new jobs do the checks (you might not want to do this for very short jobs) you will find you do not have to do the checks very frequently; or that they are easy to turn off when your node is completely loaded to optimize performance. ----- Original Message ----- From: "Michael Jennings" To: "Grigory Shamov" Cc: "Torque Users Mailing List" Sent: Tuesday, May 08, 2012 8:00 PM Subject: Re: [torqueusers] strategies for bad nodes > On Tuesday, 08 May 2012, at 16:20:39 (-0700), > Grigory Shamov wrote: > >> I've found that the most time-consuming checks would be the check_ps >> ones. So if all of them are switched off, NHC usually finishes in 5 >> sec. These tests are potentially useful though. > > Not surprising, since the only thing I couldn't fully internalize > into bash-ese was the gathering of info on running processes. :-) > > That said, I'm concerned that multiple check_ps checks would add > significantly more time *per check*. Once you have one or more > check_ps checks in your config, you'll already pay the price of > running the ps command, but it should ONLY run one time. After that, > all other check_ps checks should pull their data from the cache, so > the only additional penalty would be iterating through the cache array > and performing the logic. > > So you should be seeing a hit on the first check, but additional > checks should not carry a significant penalty (or at least not nearly > as great as the first one). Is that what you're seeing, or are you > seeing the same penalty per check_ps check? In the latter case, that > would tend to indicate a bug. > > Just to rule out the possibility of running jobs affecting the tests, > you may want to try some test runs of NHC on idle nodes to see how > long they take. On idle semi-modern hardware (dual-socket 6-core > Intel X5650s at 2.67GHz, 96GB RAM), out of 654,885 total NHC runs, I > saw none at all that look longer than 4 seconds. (We have 4 check_ps > checks active in our config on that system.) > >> Am I understanding correctly that MOM would be blocked, waiting >> while a healthcheck is being executed? > > Yes, that is currently the case. Until TORQUE runs the health check > in a thread (hopefully later on in the 4.x development series), > pbs_mom will block completely while NHC is running. That's part of > why the efficiency and watchdog mechanism in NHC is so critical. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From D.J.Baker at soton.ac.uk Wed May 9 10:02:10 2012 From: D.J.Baker at soton.ac.uk (Baker D.J.) Date: Wed, 9 May 2012 16:02:10 +0000 Subject: [torqueusers] LDAP users and torque Message-ID: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> Hello, On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? Best regards -- David. From WJEdsall at dow.com Wed May 9 09:35:37 2012 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Wed, 9 May 2012 15:35:37 +0000 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120503212548.GQ9750@lbl.gov> References: <20120503212548.GQ9750@lbl.gov> Message-ID: Hi Michael, Sorry for the late response here. I admit that I only tried it out-of-the-box and gave up when I couldn't figure out how to get it to suppress the following error. We don't specifically mount /tmp so I assumed something was wrong. [root at compute-0-31 ~]# nhc ERROR Health check failed: check_fs_mount: /tmp not mounted I suppose I should spend some time with this and create a custom configuration. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Michael Jennings Sent: Thursday, May 03, 2012 5:26 PM To: torqueusers at supercluster.org Subject: Re: [torqueusers] strategies for bad nodes On Thursday, 03 May 2012, at 17:18:57 (+0000), Edsall, William (WJ) wrote: > If anyone has suggestions for more useful 'checks' please > comment. These are pretty custom to our environment but I'd like to > hear of more. I tried the program from warewulf, but it wasn't > working well for me. If you can give more specific details on what "wasn't working well for me" means, I'd be happy to help. :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From D.J.Baker at soton.ac.uk Wed May 9 10:14:20 2012 From: D.J.Baker at soton.ac.uk (Baker D.J.) Date: Wed, 9 May 2012 16:14:20 +0000 Subject: [torqueusers] FW: LDAP users and torque In-Reply-To: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> Message-ID: <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> Hello, On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? Best regards -- David. From philippe.Weill at latmos.ipsl.fr Wed May 9 10:55:05 2012 From: philippe.Weill at latmos.ipsl.fr (Philippe Weill) Date: Wed, 09 May 2012 18:55:05 +0200 Subject: [torqueusers] FW: LDAP users and torque In-Reply-To: <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> Message-ID: <4FAAA169.3040203@latmos.ipsl.fr> Le 09/05/2012 18:14, Baker D.J. a ?crit : > Hello, > > On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. > > I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. > > My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. > > Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? > > Best regards -- David. Perhaps using nscd on compute nodes -- Weill Philippe - Administrateur Systeme et Reseaux From mej at lbl.gov Wed May 9 11:22:26 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 9 May 2012 10:22:26 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: References: <20120503212548.GQ9750@lbl.gov> Message-ID: <20120509172224.GE9750@lbl.gov> On Wednesday, 09 May 2012, at 15:35:37 (+0000), Edsall, William (WJ) wrote: > I admit that I only tried it out-of-the-box and gave up when I > couldn't figure out how to get it to suppress the following > error. We don't specifically mount /tmp so I assumed something was > wrong. > > [root at compute-0-31 ~]# nhc > ERROR Health check failed: check_fs_mount: /tmp not mounted As the documentation points out, the initial config is only a demonstration. It's not intended to be used in production. The wiki page gives a pretty detailed walk-through on how to create a config appropriate for your site, but I'm happy to answer questions if you run into any problems. :-) > I suppose I should spend some time with this and create a custom > configuration. If you choose to, please let me know if you run across any deficiencies in the documentation or the code. We've been using it heavily in production here, but every site is different, and the more sites we have working together, the better the end result will be! :-) Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From djohnson at osc.edu Wed May 9 11:38:38 2012 From: djohnson at osc.edu (Doug Johnson) Date: Wed, 09 May 2012 13:38:38 -0400 Subject: [torqueusers] FW: LDAP users and torque In-Reply-To: <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> Message-ID: At Wed, 9 May 2012 16:14:20 +0000, Baker D.J. wrote: > > Hello, > > On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. > > I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. > > My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. > > Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? > Hi, You may need to restart the pbs_mom after making a change to nsswitch.conf. Doug From D.J.Baker at soton.ac.uk Thu May 10 04:49:54 2012 From: D.J.Baker at soton.ac.uk (Baker D.J.) Date: Thu, 10 May 2012 10:49:54 +0000 Subject: [torqueusers] FW: LDAP users and torque In-Reply-To: References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk>, Message-ID: <9230853A36E499458AF5EAD38C13DCD5022A206A@UOS-MSG00039-SI.soton.ac.uk> Hello, Thanks for your comments. For sure, I've discovered that it is necessary to restart the moms to make sure that the job (once submitted) executes properly on one of the compute nodes. In this case, however, it's the job submission that is failing (I've checked that nscd isn't running on any of the relevant nodes, by the way) . So when I attempt to submit a job, I find, for example: [testuser at blue32 ~]$ qsub -l nodes=red0539 hello.sleep qsub: Bad UID for job execution MSG=User testuser does not exist in server password file I can work around this (badly) be placing the user "testuser" in the passwd file on the torque server node, however do you know if it's important to restart the torque server in this situation? Or is there a torque server configuration option that needs enabling? Any other thoughts? Best regards -- David. ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Doug Johnson [djohnson at osc.edu] Sent: 09 May 2012 18:38 To: Torque Users Mailing List Subject: Re: [torqueusers] FW: LDAP users and torque At Wed, 9 May 2012 16:14:20 +0000, Baker D.J. wrote: > > Hello, > > On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. > > I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. > > My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. > > Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? > Hi, You may need to restart the pbs_mom after making a change to nsswitch.conf. Doug _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Thu May 10 09:10:41 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 10 May 2012 15:10:41 +0000 Subject: [torqueusers] FW: LDAP users and torque In-Reply-To: <9230853A36E499458AF5EAD38C13DCD5022A206A@UOS-MSG00039-SI.soton.ac.uk> References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk>, <9230853A36E499458AF5EAD38C13DCD5022A206A@UOS-MSG00039-SI.soton.ac.uk> Message-ID: <242421BFAF465844BE24EB90BB97E22109091143@ITSDAG1D.its.iastate.edu> David, Many daemons cache configuration settings in tables on memory rather than re-reading the configuration file on each request, this makes them much faster. Unfortunately, this means that they can be operating on what you believe is stale data. A restart of the service will certainly work always work. Rather than have to restart the daemon after each configuration change, daemons are often written with a signal handler which intercepts the SIGHUP signal and causes the daemon to re-read the configuration file. You could try killall -HUP pbs_mom on a compute node to see if this works. If so, then after a configuration change you just send the pbs_mom the HUP signal. You may have to do this with the pbs_server and whatev er scheduler you are using: pbs_sched, MAUI or MOAB scheduler >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Baker D.J. >Sent: Thursday, May 10, 2012 5:50 AM >To: Torque Users Mailing List >Subject: Re: [torqueusers] FW: LDAP users and torque > >Hello, > >Thanks for your comments. For sure, I've discovered that it is >necessary to restart the moms to make sure that the job (once >submitted) executes properly on one of the compute nodes. In this >case, however, it's the job submission that is failing (I've checked >that nscd isn't running on any of the relevant nodes, by the way) . > >So when I attempt to submit a job, I find, for example: > >[testuser at blue32 ~]$ qsub -l nodes=red0539 hello.sleep >qsub: Bad UID for job execution MSG=User testuser does not exist in >server password file > >I can work around this (badly) be placing the user "testuser" in the >passwd file on the torque server node, however do you know if it's >important to restart the torque server in this situation? Or is >there a torque server configuration option that needs enabling? Any >other thoughts? > >Best regards -- David. >________________________________________ >From: torqueusers-bounces at supercluster.org [torqueusers- >bounces at supercluster.org] on behalf of Doug Johnson >[djohnson at osc.edu] >Sent: 09 May 2012 18:38 >To: Torque Users Mailing List >Subject: Re: [torqueusers] FW: LDAP users and torque > >At Wed, 9 May 2012 16:14:20 +0000, >Baker D.J. wrote: >> >> Hello, >> >> On our RedHat Linux 5* cluster we are running torque 2.5.7, and >moab 6.1.4. I have recently set up an LDAP server, and I'm starting >to populate it with users and groups for a consortium project that >we're working on. I'm already using NIS for university users, and >LDAP is intended for external people. >> >> I set up the LDAP client configuration files on the login nodes, >the head node (running torque and moab) and (as a test) on one of >the compute nodes. I double checked that I could "see" LDAP users on >these client nodes using getent. I then logged in to one of the >login nodes as an LDAP user and attempted to submit a job. >> >> My initial attempt to submit a job failed - the error message >indicated that torque could not find the user in the /etc/passwd on >the node running the torque server. Given that I had set up >ldap.conf and nsswitch.conf on that node, this seems odd. As a work >around I placed the LDAP users in the /etc/passwd file on the torque >server node and all went well. >> >> Could someone please help me to understand what might be wrong >here. Do I need to configure torque in a particular way to fully >integrate with LDAP, for example? >> > >Hi, > >You may need to restart the pbs_mom after making a change to >nsswitch.conf. > >Doug >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From michasch at uni-muenster.de Thu May 10 06:09:24 2012 From: michasch at uni-muenster.de (Michael Schulz) Date: Thu, 10 May 2012 14:09:24 +0200 Subject: [torqueusers] LDAP users and torque In-Reply-To: <9230853A36E499458AF5EAD38C13DCD5022A206A@UOS-MSG00039-SI.soton.ac.uk> References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk>, <9230853A36E499458AF5EAD38C13DCD5022A206A@UOS-MSG00039-SI.soton.ac.uk> Message-ID: Am 10.05.2012 um 12:49 schrieb Baker D.J.: Hi, have you compiled torque by yourself? How did you configure it? Michael > Hello, > > Thanks for your comments. For sure, I've discovered that it is necessary to restart the moms to make sure that the job (once submitted) executes properly on one of the compute nodes. In this case, however, it's the job submission that is failing (I've checked that nscd isn't running on any of the relevant nodes, by the way) . > > So when I attempt to submit a job, I find, for example: > > [testuser at blue32 ~]$ qsub -l nodes=red0539 hello.sleep > qsub: Bad UID for job execution MSG=User testuser does not exist in server password file > > I can work around this (badly) be placing the user "testuser" in the passwd file on the torque server node, however do you know if it's important to restart the torque server in this situation? Or is there a torque server configuration option that needs enabling? Any other thoughts? > > Best regards -- David. > ________________________________________ > From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Doug Johnson [djohnson at osc.edu] > Sent: 09 May 2012 18:38 > To: Torque Users Mailing List > Subject: Re: [torqueusers] FW: LDAP users and torque > > At Wed, 9 May 2012 16:14:20 +0000, > Baker D.J. wrote: >> >> Hello, >> >> On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. >> >> I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. >> >> My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. >> >> Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? >> > > Hi, > > You may need to restart the pbs_mom after making a change to > nsswitch.conf. > > Doug > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Michael Schulz Institut f?r Geophysik, WWU M?nster Corrensstr. 24, 48149 M?nster Tel.: +49 251 83 33938, Fax: +49 251 83 36100 From bircoph at gmail.com Thu May 10 14:33:30 2012 From: bircoph at gmail.com (Andrew Savchenko) Date: Fri, 11 May 2012 00:33:30 +0400 Subject: [torqueusers] Can't limit resources for a queue Message-ID: <20120511003330.241a8fcd.bircoph@gmail.com> Hello, I use torque-3.0.5 and have the following problem. In our cluster each node is configured with np value corresponding to the number of physical cores available, CPUs are not capable of hyper-threading. I want to allow queue "xxl" to use no more than 64 cores and to run no more than 64 jobs. The reason to limit number of CPU cores available in addition to number of jobs is that some jobs may be MPI tasks and request a lot of cores, or just local multi-process task. A queue is configured as follows: # qmgr -c "list queue xxl" Queue xxl queue_type = Execution max_queuable = 500 max_user_queuable = 70 total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 max_running = 64 resources_max.walltime = 4320:00:00 resources_min.walltime = 168:00:01 resources_default.walltime = 2160:00:00 mtime = Sat May 5 20:35:12 2012 disallowed_types = interactive resources_available.procct = 64 max_user_run = 32 enabled = True started = True While max_running and max_user_run limits work well, user is able to submit and execute job requesting more CPU cores than allowed: $ qsub -q xxl -l nodes=10:ppn=8 test.job 57.master Job is executed and wc -l $PBS_NODEFILE inside the jobs confirms that 80 cores are allocated. It looks like resources_available.procct value is ignored. I do not understand why. Admin manual: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/4.1queueconfig.php says: 1) procct sets limits on the total number of execution slots (procs) allocated to a job. The number of procs is calculated by summing the products of all node and ppn entries for a job. 2) resources_available Description: Specifies to cumulative resources available to all jobs running in the queue. I use maui-3.3.1 as a scheduler, just simple default config to work with PBS, no queue- or node-specific options yet. How can I fix this issue? Best regards, Andrew Savchenko -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120511/ccbc5c4f/attachment.bin From Ole.H.Nielsen at fysik.dtu.dk Fri May 11 01:15:16 2012 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Fri, 11 May 2012 09:15:16 +0200 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120503212548.GQ9750@lbl.gov> References: <20120503212548.GQ9750@lbl.gov> Message-ID: <4FACBC84.7010604@fysik.dtu.dk> Hi Michael, I'm really happy to discover your Node Health Check project, having asked several times in the past this list for suggestions, but your solution seems to be the most mature one I've seen so far! I've installed the warewulf-nhc RHEL5 RPM from http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check and configured the checks in nhc.conf. One snag I've discovered so far: On a free node nhc returns this error: /etc/nhc/scripts/ww_job.nhc: line 18: /var/spool/torque/mom_priv/jobs/*.JB: No such file or directory whereas busy Torque nodes pass the check. Obviously the script's line 18: for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do needs to be modified for the case of no *.JB files, but I don't know the best way to do this. FYI, our CentOS5 bash version is bash-3.2-24.el5. Another little detail that requires modification at our site is the scripts /usr/libexec/nhc/node* where the hard-coded PBSNODES="/usr/bin/pbsnodes" doesn't match our installation in /usr/local/bin. Would this more flexible solution work? PATH="/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin" PBSNODES="pbsnodes" Thanks a lot, Ole -- Ole Holm Nielsen Department of Physics, Technical University of Denmark From D.J.Baker at soton.ac.uk Fri May 11 02:06:17 2012 From: D.J.Baker at soton.ac.uk (Baker D.J.) Date: Fri, 11 May 2012 08:06:17 +0000 Subject: [torqueusers] LDAP users and torque In-Reply-To: References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk>, <9230853A36E499458AF5EAD38C13DCD5022A206A@UOS-MSG00039-SI.soton.ac.uk>, Message-ID: <9230853A36E499458AF5EAD38C13DCD5022A2445@UOS-MSG00039-SI.soton.ac.uk> Hi Michael, We configured and installed torque ourselves. The configuration options that we used are: ./configure --prefix=/local/software/torque/2.5.7 --with-maildomain=soton.ac.uk --disable-gui --enable-nvidia-gpus --disable-spool Best regards -- David. ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Michael Schulz [michasch at uni-muenster.de] Sent: 10 May 2012 13:09 To: Torque Users Mailing List Subject: Re: [torqueusers] LDAP users and torque Am 10.05.2012 um 12:49 schrieb Baker D.J.: Hi, have you compiled torque by yourself? How did you configure it? Michael > Hello, > > Thanks for your comments. For sure, I've discovered that it is necessary to restart the moms to make sure that the job (once submitted) executes properly on one of the compute nodes. In this case, however, it's the job submission that is failing (I've checked that nscd isn't running on any of the relevant nodes, by the way) . > > So when I attempt to submit a job, I find, for example: > > [testuser at blue32 ~]$ qsub -l nodes=red0539 hello.sleep > qsub: Bad UID for job execution MSG=User testuser does not exist in server password file > > I can work around this (badly) be placing the user "testuser" in the passwd file on the torque server node, however do you know if it's important to restart the torque server in this situation? Or is there a torque server configuration option that needs enabling? Any other thoughts? > > Best regards -- David. > ________________________________________ > From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Doug Johnson [djohnson at osc.edu] > Sent: 09 May 2012 18:38 > To: Torque Users Mailing List > Subject: Re: [torqueusers] FW: LDAP users and torque > > At Wed, 9 May 2012 16:14:20 +0000, > Baker D.J. wrote: >> >> Hello, >> >> On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. >> >> I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. >> >> My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. >> >> Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? >> > > Hi, > > You may need to restart the pbs_mom after making a change to > nsswitch.conf. > > Doug > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Michael Schulz Institut f?r Geophysik, WWU M?nster Corrensstr. 24, 48149 M?nster Tel.: +49 251 83 33938, Fax: +49 251 83 36100 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From jjc at iastate.edu Fri May 11 09:01:33 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 11 May 2012 15:01:33 +0000 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <4FACBC84.7010604@fysik.dtu.dk> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> Message-ID: <242421BFAF465844BE24EB90BB97E221090912FD@ITSDAG1D.its.iastate.edu> Ole, In scripts for this kind of situation (possible empty list), I check whether the list is empty first, and only execute the for when the list is non-empty. In ksh this is done in the following way: replace: for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do ... done with JOBLIST=`/bin/ls /var/spool/torque/mom_priv/jobs/ | grep \.JB` if [ -n "${JOBLIST}" ] ; then for JOBFILE in "${JOBLIST} ; do ... done fi In other shells the syntax may be different. James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Ole Holm Nielsen >Sent: Friday, May 11, 2012 2:15 AM >To: torqueusers at supercluster.org >Subject: Re: [torqueusers] strategies for bad nodes > >Hi Michael, > >I'm really happy to discover your Node Health Check project, having >asked several times in the past this list for suggestions, but your >solution seems to be the most mature one I've seen so far! I've >installed the warewulf-nhc RHEL5 RPM from >http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check and >configured the checks in nhc.conf. > >One snag I've discovered so far: On a free node nhc returns this >error: >/etc/nhc/scripts/ww_job.nhc: line 18: >/var/spool/torque/mom_priv/jobs/*.JB: No such file or directory >whereas busy Torque nodes pass the check. > >Obviously the script's line 18: > for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do needs >to be modified for the case of no *.JB files, but I don't know the >best way to do this. FYI, our CentOS5 bash version is bash-3.2- >24.el5. > >Another little detail that requires modification at our site is the >scripts /usr/libexec/nhc/node* where the hard-coded >PBSNODES="/usr/bin/pbsnodes" doesn't match our installation in >/usr/local/bin. Would this more flexible solution work? >PATH="/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin" >PBSNODES="pbsnodes" > >Thanks a lot, >Ole > >-- >Ole Holm Nielsen >Department of Physics, Technical University of Denmark >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Fri May 11 10:26:56 2012 From: mej at lbl.gov (Michael Jennings) Date: Fri, 11 May 2012 09:26:56 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <4FACBC84.7010604@fysik.dtu.dk> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> Message-ID: <20120511162655.GF9879@lbl.gov> On Friday, 11 May 2012, at 09:15:16 (+0200), Ole Holm Nielsen wrote: > I'm really happy to discover your Node Health Check project, having > asked several times in the past this list for suggestions, but your > solution seems to be the most mature one I've seen so far! Thanks! :-) Glad to hear you like it! > One snag I've discovered so far: On a free node nhc returns this error: > /etc/nhc/scripts/ww_job.nhc: line 18: > /var/spool/torque/mom_priv/jobs/*.JB: No such file or directory > whereas busy Torque nodes pass the check. > > Obviously the script's line 18: > for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do > needs to be modified for the case of no *.JB files, but I don't know > the best way to do this. FYI, our CentOS5 bash version is > bash-3.2-24.el5. Doh. Does the attached patch fix the issue? > Another little detail that requires modification at our site is the > scripts /usr/libexec/nhc/node* where the hard-coded > PBSNODES="/usr/bin/pbsnodes" doesn't match our installation in > /usr/local/bin. Would this more flexible solution work? > PATH="/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin" > PBSNODES="pbsnodes" This situation is fixed in SVN and will be much easier to deal with in 1.2. You'll be able to override that setting in the config file, like so: * || export PBSNODES="/usr/local/bin/pbsnodes" Is that a reasonable solution? You can also override $PATH in much the same way, and in 1.2, $PBSNODES will default to simply "pbsnodes" instead of an absolute path. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From gas5x at yahoo.com Fri May 11 10:38:26 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Fri, 11 May 2012 09:38:26 -0700 (PDT) Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120511162655.GF9879@lbl.gov> Message-ID: <1336754306.51661.YahooMailClassic@web111310.mail.gq1.yahoo.com> Dear Michael, I've increased the $TIMEOUT to 9 sec. It seem to work, but perhaps interacted weird with Torque; server might be having problems interacting when modes when. What are the typical settings for Torque you are using on the clusters running NHC? (And, which settings might be relevant?). Like, 'server node_check_rate' and 'server tcp_timeout' ? -- Grigory Shamov HPC Analyst University of Manitoba --- On Fri, 5/11/12, Michael Jennings wrote: > From: Michael Jennings > Subject: Re: [torqueusers] strategies for bad nodes > To: "Ole Holm Nielsen" > Cc: torqueusers at supercluster.org > Date: Friday, May 11, 2012, 9:26 AM > On Friday, 11 May 2012, at 09:15:16 > (+0200), > Ole Holm Nielsen wrote: > > > I'm really happy to discover your Node Health Check > project, having > > asked several times in the past this list for > suggestions, but your > > solution seems to be the most mature one I've seen so > far! > > Thanks!? :-)? Glad to hear you like it! > > > One snag I've discovered so far: On a free node nhc > returns this error: > > /etc/nhc/scripts/ww_job.nhc: line 18: > > /var/spool/torque/mom_priv/jobs/*.JB: No such file or > directory > > whereas busy Torque nodes pass the check. > > > > Obviously the script's line 18: > >? ???for JOBFILE in > /var/spool/torque/mom_priv/jobs/*.JB ; do > > needs to be modified for the case of no *.JB files, but > I don't know > > the best way to do this.? FYI, our CentOS5 bash > version is > > bash-3.2-24.el5. > > Doh.? Does the attached patch fix the issue? > > > Another little detail that requires modification at our > site is the > > scripts /usr/libexec/nhc/node* where the hard-coded > > PBSNODES="/usr/bin/pbsnodes" doesn't match our > installation in > > /usr/local/bin. Would this more flexible solution > work? > > PATH="/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin" > > PBSNODES="pbsnodes" > > This situation is fixed in SVN and will be much easier to > deal with in > 1.2.? You'll be able to override that setting in the > config file, like > so: > > * || export PBSNODES="/usr/local/bin/pbsnodes" > > Is that a reasonable solution?? You can also override > $PATH in much > the same way, and in 1.2, $PBSNODES will default to simply > "pbsnodes" > instead of an absolute path. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E? ? ? ? W: 510-495-2687 > MS 050B-3209? ? ? ? ? F: > 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From Ole.H.Nielsen at fysik.dtu.dk Fri May 11 13:53:41 2012 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Fri, 11 May 2012 21:53:41 +0200 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120511162655.GF9879@lbl.gov> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> <20120511162655.GF9879@lbl.gov> Message-ID: <4FAD6E45.60609@fysik.dtu.dk> On 11-05-2012 18:26, Michael Jennings wrote: >> One snag I've discovered so far: On a free node nhc returns this error: >> /etc/nhc/scripts/ww_job.nhc: line 18: >> /var/spool/torque/mom_priv/jobs/*.JB: No such file or directory >> whereas busy Torque nodes pass the check. >> >> Obviously the script's line 18: >> for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do >> needs to be modified for the case of no *.JB files, but I don't know >> the best way to do this. FYI, our CentOS5 bash version is >> bash-3.2-24.el5. > > Doh. Does the attached patch fix the issue? Perhaps, but the attachment seems to be missing. Is the patch in the SVN now? >> Another little detail that requires modification at our site is the >> scripts /usr/libexec/nhc/node* where the hard-coded >> PBSNODES="/usr/bin/pbsnodes" doesn't match our installation in >> /usr/local/bin. Would this more flexible solution work? >> PATH="/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin" >> PBSNODES="pbsnodes" > > This situation is fixed in SVN and will be much easier to deal with in > 1.2. You'll be able to override that setting in the config file, like > so: > > * || export PBSNODES="/usr/local/bin/pbsnodes" > > Is that a reasonable solution? You can also override $PATH in much > the same way, and in 1.2, $PBSNODES will default to simply "pbsnodes" > instead of an absolute path. Sounds really good. A simple question: Could you specify explicitly on the NHC Wiki page exactly the command used to download NHC SVN versions like 1.2? I'm not so familiar with Trac, so maybe I'm overlooking something obvious. Thanks, Ole From stijn.deweirdt at ugent.be Sat May 12 13:51:47 2012 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Sat, 12 May 2012 21:51:47 +0200 Subject: [torqueusers] migrating from 3 to 4 Message-ID: <4FAEBF53.4070703@ugent.be> hi all, we are migrating from 3.0.3 to 4.0.1 and are seening some issues. we took a simple backup by tarring the pbs spool dir and them setting it back on the 4.0.1 machine. for regular jobs this looks fine (qstat -f shows a comment = required class not found line) the array job format is different and thus the array jobs are not recognised at all. is there a proper way to backup and restore? and is there a way to convert the array jobs so they can be restored. many thanks stijn From stijn.deweirdt at ugent.be Sat May 12 14:10:13 2012 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Sat, 12 May 2012 22:10:13 +0200 Subject: [torqueusers] migrating from 3 to 4 In-Reply-To: <4FAEBF53.4070703@ugent.be> References: <4FAEBF53.4070703@ugent.be> Message-ID: <4FAEC3A5.9020803@ugent.be> On 05/12/2012 09:51 PM, Stijn De Weirdt wrote: > hi all, > > we are migrating from 3.0.3 to 4.0.1 and are seening some issues. > > we took a simple backup by tarring the pbs spool dir and them setting it > back on the 4.0.1 machine. > for regular jobs this looks fine (qstat -f shows a > comment = required class not found > line) this seems from moab. safe to ignore i guess. > > the array job format is different and thus the array jobs are not > recognised at all. > > is there a proper way to backup and restore? and is there a way to > convert the array jobs so they can be restored. > > many thanks > > stijn > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > From j.blank at fz-juelich.de Mon May 14 07:08:20 2012 From: j.blank at fz-juelich.de (Joerg Blank) Date: Mon, 14 May 2012 15:08:20 +0200 Subject: [torqueusers] Letzte beste version des Retrievals Message-ID: /scratch/2/joergb/essence/tracer_v7/ret_final_v0/ Gru?, J?rg From j.blank at fz-juelich.de Mon May 14 07:11:20 2012 From: j.blank at fz-juelich.de (Joerg Blank) Date: Mon, 14 May 2012 15:11:20 +0200 Subject: [torqueusers] Letzte beste version des Retrievals In-Reply-To: References: Message-ID: Obviously it's possible to send an eMail in Thunderbird to nowhere ... and it seems like nowhere is this mailing list. I'm sorry for the misdirected mail, please ignore :-) Regards, J?rg Blank From craig.tierney at noaa.gov Thu May 10 16:42:02 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Thu, 10 May 2012 16:42:02 -0600 Subject: [torqueusers] python in qsub In-Reply-To: References: Message-ID: <4FAC443A.40506@noaa.gov> Christina, The environment never gets setup correctly in batch jobs (at least with torque and SGE) when using sh, bash, or ksh because their behavior changes between interactive and non-interactive use. Have the user add "--login" to the script call (#!/bin/sh --login). See if that sets all the paths correctly and resolves the problem. Craig On 5/1/12 11:49 AM, Christina Salls wrote: > Hi all, > > My configuration consists of Torque 2.5.9 running on RHEL 6 on a > single cluster. I have a user that is able to interactively run python > and R, but when his job is submitted to Torque, it fails with this message: > > Traceback (most recent call last): > > File "../progs/Kriging_Batch.py", line 110, in > > import rpy2.robjects as robjects > > ImportError: No module named rpy2.robjects > > > I asked him to include a printenv in his script and compared it to the > environment variables that are inherent in his login. I could not tell > from the comparison what might be causing this problem. > > > His script looks like this: > > > [root at zeus batch]# more runkrig.sh > > #!/bin/sh > > #submit with qsub runkrig.sh > > #PBS -N GaugesJob > > #PBS -l select=1:ncpus=1 > > #PBS -q hydrology > > #PBS -M tim.hunter at noaa.gov > > source /opt/intel/composer_xe_2011_sp1.7.256/bin/compilervars.sh intel64 > > cd /zeus/d2/users/hunter/lbrm/kriging/batch > > echo 'Begin Kriging of Met Data with python and R' > > date > > python ../progs/Kriging_Batch.py $1 $2 $3 > > echo 'Kriging operations complete!' > > date > > > The output file looks like this: > > > [root at zeus batch]# more GaugesJob.o786 > > Begin Kriging of Met Data with python and R > > Fri Apr 27 13:44:42 CDT 2012 > > Kriging operations complete! > > Fri Apr 27 13:44:42 CDT 2012 > > > > The Kriging_Batch.py script works just fine interactively. If I run the > import command interactively, it also works. > > > eg. > > python -c "import rpy2.robjects as robjects" > > > I am sure there is a simple explanation, and if any of you have any > clues to lead me in the right direction, I would greatly appreciate it. > > > Even in the torque environment, the other python imports are working > properly. It only seems to be choking on the rpy2 import. > > > This is the one portion of the script > > > #-------------------------------------------------------------------------------------- > > import os > > from os.path import normpath > > import sys > > import shutil > > import csv > > > import rpy2.objects as robjects > > > # > > r = robjects.r > > path = os.getcwd() > > print "curdir = %s" %path > > > # > > oldPath = os.environ['PATH'].split(os.pathsep) > > newPath = os.environ['PATH'].split(os.pathsep) > > newPath = os.pathsep.join(newPath[len(oldPath):] + newPath[:len(oldPath)]) > > > # > > # Load the R functions > > # > > print "load the R functions" > > r.load('/zeus/d2/users/hunter/lbrm/kriging/progs/do_krig_batch.RData') > > > # > > # Force stdout into an unbuffered mode > > # > > sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > > > > Any advice, clues, hints, moral support, etc.... welcomed and appreciated. > > > Thanks, > > > Christina > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > From christina.salls at noaa.gov Fri May 11 05:26:09 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Fri, 11 May 2012 07:26:09 -0400 Subject: [torqueusers] python in qsub In-Reply-To: <4FAC443A.40506@noaa.gov> References: <4FAC443A.40506@noaa.gov> Message-ID: Thanks Craig! Good suggestion. On Thu, May 10, 2012 at 6:42 PM, Craig Tierney wrote: > Christina, > > The environment never gets setup correctly in batch jobs (at least with > torque and SGE) when using sh, bash, or ksh because their behavior changes > between interactive and non-interactive use. Have the user add "--login" > to the script call (#!/bin/sh --login). See if that sets all the paths > correctly and resolves the problem. > > Craig > > > > On 5/1/12 11:49 AM, Christina Salls wrote: > >> Hi all, >> >> My configuration consists of Torque 2.5.9 running on RHEL 6 on a >> single cluster. I have a user that is able to interactively run python >> and R, but when his job is submitted to Torque, it fails with this >> message: >> >> Traceback (most recent call last): >> >> File "../progs/Kriging_Batch.py", line 110, in >> >> import rpy2.robjects as robjects >> >> ImportError: No module named rpy2.robjects >> >> >> I asked him to include a printenv in his script and compared it to the >> environment variables that are inherent in his login. I could not tell >> from the comparison what might be causing this problem. >> >> >> His script looks like this: >> >> >> [root at zeus batch]# more runkrig.sh >> >> #!/bin/sh >> >> #submit with qsub runkrig.sh >> >> #PBS -N GaugesJob >> >> #PBS -l select=1:ncpus=1 >> >> #PBS -q hydrology >> >> #PBS -M tim.hunter at noaa.gov >> >> >> source /opt/intel/composer_xe_2011_**sp1.7.256/bin/compilervars.sh >> intel64 >> >> cd /zeus/d2/users/hunter/lbrm/**kriging/batch >> >> echo 'Begin Kriging of Met Data with python and R' >> >> date >> >> python ../progs/Kriging_Batch.py $1 $2 $3 >> >> echo 'Kriging operations complete!' >> >> date >> >> >> The output file looks like this: >> >> >> [root at zeus batch]# more GaugesJob.o786 >> >> Begin Kriging of Met Data with python and R >> >> Fri Apr 27 13:44:42 CDT 2012 >> >> Kriging operations complete! >> >> Fri Apr 27 13:44:42 CDT 2012 >> >> >> >> The Kriging_Batch.py script works just fine interactively. If I run the >> import command interactively, it also works. >> >> >> eg. >> >> python -c "import rpy2.robjects as robjects" >> >> >> I am sure there is a simple explanation, and if any of you have any >> clues to lead me in the right direction, I would greatly appreciate it. >> >> >> Even in the torque environment, the other python imports are working >> properly. It only seems to be choking on the rpy2 import. >> >> >> This is the one portion of the script >> >> >> #-----------------------------**------------------------------** >> --------------------------- >> >> import os >> >> from os.path import normpath >> >> import sys >> >> import shutil >> >> import csv >> >> >> import rpy2.objects as robjects >> >> >> # >> >> r = robjects.r >> >> path = os.getcwd() >> >> print "curdir = %s" %path >> >> >> # >> >> oldPath = os.environ['PATH'].split(os.**pathsep) >> >> newPath = os.environ['PATH'].split(os.**pathsep) >> >> newPath = os.pathsep.join(newPath[len(**oldPath):] + >> newPath[:len(oldPath)]) >> >> >> # >> >> # Load the R functions >> >> # >> >> print "load the R functions" >> >> r.load('/zeus/d2/users/hunter/**lbrm/kriging/progs/do_krig_** >> batch.RData') >> >> >> # >> >> # Force stdout into an unbuffered mode >> >> # >> >> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) >> >> >> >> Any advice, clues, hints, moral support, etc.... welcomed and appreciated. >> >> >> Thanks, >> >> >> Christina >> >> >> >> >> -- >> Christina A. Salls >> GLERL Computer Group >> help.glerl at noaa.gov >> Help Desk x2127 >> >> Christina.Salls at noaa.gov >> > >> Voice Mail 734-741-2446 >> >> >> -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120511/ef542894/attachment-0001.html From christina.salls at noaa.gov Fri May 11 05:27:45 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Fri, 11 May 2012 07:27:45 -0400 Subject: [torqueusers] python in qsub In-Reply-To: References: <4FAC443A.40506@noaa.gov> Message-ID: Actually, I have been thinking that I might need to add R and/or rpy2 to the provisioning of the nodes. We will try the adding the --login to the script first. Thanks again! On Fri, May 11, 2012 at 7:26 AM, Christina Salls wrote: > Thanks Craig! Good suggestion. > > > On Thu, May 10, 2012 at 6:42 PM, Craig Tierney wrote: > >> Christina, >> >> The environment never gets setup correctly in batch jobs (at least with >> torque and SGE) when using sh, bash, or ksh because their behavior changes >> between interactive and non-interactive use. Have the user add "--login" >> to the script call (#!/bin/sh --login). See if that sets all the paths >> correctly and resolves the problem. >> >> Craig >> >> >> >> On 5/1/12 11:49 AM, Christina Salls wrote: >> >>> Hi all, >>> >>> My configuration consists of Torque 2.5.9 running on RHEL 6 on a >>> single cluster. I have a user that is able to interactively run python >>> and R, but when his job is submitted to Torque, it fails with this >>> message: >>> >>> Traceback (most recent call last): >>> >>> File "../progs/Kriging_Batch.py", line 110, in >>> >>> import rpy2.robjects as robjects >>> >>> ImportError: No module named rpy2.robjects >>> >>> >>> I asked him to include a printenv in his script and compared it to the >>> environment variables that are inherent in his login. I could not tell >>> from the comparison what might be causing this problem. >>> >>> >>> His script looks like this: >>> >>> >>> [root at zeus batch]# more runkrig.sh >>> >>> #!/bin/sh >>> >>> #submit with qsub runkrig.sh >>> >>> #PBS -N GaugesJob >>> >>> #PBS -l select=1:ncpus=1 >>> >>> #PBS -q hydrology >>> >>> #PBS -M tim.hunter at noaa.gov >>> >>> >>> source /opt/intel/composer_xe_2011_**sp1.7.256/bin/compilervars.sh >>> intel64 >>> >>> cd /zeus/d2/users/hunter/lbrm/**kriging/batch >>> >>> echo 'Begin Kriging of Met Data with python and R' >>> >>> date >>> >>> python ../progs/Kriging_Batch.py $1 $2 $3 >>> >>> echo 'Kriging operations complete!' >>> >>> date >>> >>> >>> The output file looks like this: >>> >>> >>> [root at zeus batch]# more GaugesJob.o786 >>> >>> Begin Kriging of Met Data with python and R >>> >>> Fri Apr 27 13:44:42 CDT 2012 >>> >>> Kriging operations complete! >>> >>> Fri Apr 27 13:44:42 CDT 2012 >>> >>> >>> >>> The Kriging_Batch.py script works just fine interactively. If I run the >>> import command interactively, it also works. >>> >>> >>> eg. >>> >>> python -c "import rpy2.robjects as robjects" >>> >>> >>> I am sure there is a simple explanation, and if any of you have any >>> clues to lead me in the right direction, I would greatly appreciate it. >>> >>> >>> Even in the torque environment, the other python imports are working >>> properly. It only seems to be choking on the rpy2 import. >>> >>> >>> This is the one portion of the script >>> >>> >>> #-----------------------------**------------------------------** >>> --------------------------- >>> >>> import os >>> >>> from os.path import normpath >>> >>> import sys >>> >>> import shutil >>> >>> import csv >>> >>> >>> import rpy2.objects as robjects >>> >>> >>> # >>> >>> r = robjects.r >>> >>> path = os.getcwd() >>> >>> print "curdir = %s" %path >>> >>> >>> # >>> >>> oldPath = os.environ['PATH'].split(os.**pathsep) >>> >>> newPath = os.environ['PATH'].split(os.**pathsep) >>> >>> newPath = os.pathsep.join(newPath[len(**oldPath):] + >>> newPath[:len(oldPath)]) >>> >>> >>> # >>> >>> # Load the R functions >>> >>> # >>> >>> print "load the R functions" >>> >>> r.load('/zeus/d2/users/hunter/**lbrm/kriging/progs/do_krig_** >>> batch.RData') >>> >>> >>> # >>> >>> # Force stdout into an unbuffered mode >>> >>> # >>> >>> sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) >>> >>> >>> >>> Any advice, clues, hints, moral support, etc.... welcomed and >>> appreciated. >>> >>> >>> Thanks, >>> >>> >>> Christina >>> >>> >>> >>> >>> -- >>> Christina A. Salls >>> GLERL Computer Group >>> help.glerl at noaa.gov >>> Help Desk x2127 >>> >>> Christina.Salls at noaa.gov >>> > >>> Voice Mail 734-741-2446 >>> >>> >>> > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120511/626e5b61/attachment.html From craig.tierney at noaa.gov Fri May 11 08:54:24 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Fri, 11 May 2012 08:54:24 -0600 Subject: [torqueusers] python in qsub In-Reply-To: References: <4FAC443A.40506@noaa.gov> Message-ID: <4FAD2820.4060306@noaa.gov> Christina, You need all the software installed on the compute nodes for this to work. Login to the node directly and try and run the batch script as "sh batch.pbs", if it doesn't work then you need fix that first. Craig On 5/11/12 5:27 AM, Christina Salls wrote: > Actually, I have been thinking that I might need to add R and/or rpy2 to > the provisioning of the nodes. We will try the adding the --login to > the script first. Thanks again! > > On Fri, May 11, 2012 at 7:26 AM, Christina Salls > > wrote: > > Thanks Craig! Good suggestion. > > > On Thu, May 10, 2012 at 6:42 PM, Craig Tierney > > wrote: > > Christina, > > The environment never gets setup correctly in batch jobs (at > least with torque and SGE) when using sh, bash, or ksh because > their behavior changes between interactive and non-interactive > use. Have the user add "--login" to the script call (#!/bin/sh > --login). See if that sets all the paths correctly and resolves > the problem. > > Craig > > > > On 5/1/12 11:49 AM, Christina Salls wrote: > > Hi all, > > My configuration consists of Torque 2.5.9 running on > RHEL 6 on a > single cluster. I have a user that is able to interactively > run python > and R, but when his job is submitted to Torque, it fails > with this message: > > Traceback (most recent call last): > > File "../progs/Kriging_Batch.py", line 110, in > > import rpy2.robjects as robjects > > ImportError: No module named rpy2.robjects > > > I asked him to include a printenv in his script and compared > it to the > environment variables that are inherent in his login. I > could not tell > from the comparison what might be causing this problem. > > > His script looks like this: > > > [root at zeus batch]# more runkrig.sh > > #!/bin/sh > > #submit with qsub runkrig.sh > > #PBS -N GaugesJob > > #PBS -l select=1:ncpus=1 > > #PBS -q hydrology > > #PBS -M tim.hunter at noaa.gov > > > > > source > /opt/intel/composer_xe_2011___sp1.7.256/bin/compilervars.sh > intel64 > > cd /zeus/d2/users/hunter/lbrm/__kriging/batch > > echo 'Begin Kriging of Met Data with python and R' > > date > > python ../progs/Kriging_Batch.py $1 $2 $3 > > echo 'Kriging operations complete!' > > date > > > The output file looks like this: > > > [root at zeus batch]# more GaugesJob.o786 > > Begin Kriging of Met Data with python and R > > Fri Apr 27 13:44:42 CDT 2012 > > Kriging operations complete! > > Fri Apr 27 13:44:42 CDT 2012 > > > > The Kriging_Batch.py script works just fine interactively. > If I run the > import command interactively, it also works. > > > eg. > > python -c "import rpy2.robjects as robjects" > > > I am sure there is a simple explanation, and if any of you > have any > clues to lead me in the right direction, I would greatly > appreciate it. > > > Even in the torque environment, the other python imports are > working > properly. It only seems to be choking on the rpy2 import. > > > This is the one portion of the script > > > #-----------------------------__------------------------------__--------------------------- > > import os > > from os.path import normpath > > import sys > > import shutil > > import csv > > > import rpy2.objects as robjects > > > # > > r = robjects.r > > path = os.getcwd() > > print "curdir = %s" %path > > > # > > oldPath = os.environ['PATH'].split(os.__pathsep) > > newPath = os.environ['PATH'].split(os.__pathsep) > > newPath = os.pathsep.join(newPath[len(__oldPath):] + > newPath[:len(oldPath)]) > > > # > > # Load the R functions > > # > > print "load the R functions" > > r.load('/zeus/d2/users/hunter/__lbrm/kriging/progs/do_krig___batch.RData') > > > # > > # Force stdout into an unbuffered mode > > # > > sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) > > > > Any advice, clues, hints, moral support, etc.... welcomed > and appreciated. > > > Thanks, > > > Christina > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > > > Help Desk x2127 > > Christina.Salls at noaa.gov > > > Voice Mail 734-741-2446 > > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > From WJEdsall at dow.com Mon May 14 12:37:34 2012 From: WJEdsall at dow.com (Edsall, William (WJ)) Date: Mon, 14 May 2012 18:37:34 +0000 Subject: [torqueusers] NODESET failover possibilities Message-ID: Hello, We've been investigating the use of nodesets to failover between one type of node property and another. In the example below we would like Maui to choose infini2 if the number of resources we request are available. If not, we would like it to choose infini. Example: #PBS -W "NODESET:ONEOF:FEATURE:infini2,infini" Is this syntax acceptable? It seems to be choosing infini2 but will not fail over to infini. We are also forcing a default feature on this queue so that the jobs always go to an 'eth' tagged node. Could this be conflicting with the nodeset? CLASSCFG[workq] QDEF=workq PRIORITY=20000 DEFAULT.FEATURES=eth Thanks, Will -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120514/9e344d40/attachment-0001.html From mej at lbl.gov Mon May 14 15:40:08 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 14 May 2012 14:40:08 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <242421BFAF465844BE24EB90BB97E221090912FD@ITSDAG1D.its.iastate.edu> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> <242421BFAF465844BE24EB90BB97E221090912FD@ITSDAG1D.its.iastate.edu> Message-ID: <20120514214007.GE5955@lbl.gov> On Friday, 11 May 2012, at 15:01:33 (+0000), Coyle, James J [ITACD] wrote: > In scripts for this kind of situation (possible > empty list), I check whether the list is empty first, > and only execute the for when the list is non-empty. > > In ksh this is done in the following way: > replace: > > for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do > ... > done > > with > > JOBLIST=`/bin/ls /var/spool/torque/mom_priv/jobs/ | grep \.JB` > if [ -n "${JOBLIST}" ] ; then > for JOBFILE in "${JOBLIST} ; do > ... > done > fi While this approach works, one of the goals of NHC is to eliminate as many forks/subcommands as possible. The technique you suggest will spawn 3 subprocesses in addition to the bash which is executing the script. The following, however, will eliminate the unnecessary subprocesses and accomplish the same thing. JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) for JOBFILE in "${JOBLIST[@]}" ; do ... done Same effect, but much more efficient! ;-) HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From jjc at iastate.edu Mon May 14 17:18:26 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 14 May 2012 23:18:26 +0000 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <20120514214007.GE5955@lbl.gov> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> <242421BFAF465844BE24EB90BB97E221090912FD@ITSDAG1D.its.iastate.edu> <20120514214007.GE5955@lbl.gov> Message-ID: <242421BFAF465844BE24EB90BB97E221090915EE@ITSDAG1D.its.iastate.edu> Michael, I had not worried about efficiency, but your solution needs to be modified to be correct. There are two problems: 1) It does not work (syntax error, easily fixed) 2) It does not work in the case asked about, that is when there are no files of the form: /var/spool/torque/mom_priv/jobs/*.JB A corrected solution is: JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) if [ "${JOBLIST}" != "/var/spool/torque/mom_priv/jobs/*.JB" ] then JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) for JOBFILE in "${JOBLIST[@]}" do ... done fi I agree that this is more efficient than the method in my email. Thanks for the improved solution. - Jim C. Details: 1)Testing the proposed solution shows: # JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) for JOBFILE in "${JOBLIST[@]}" ; do echo "a $JOBFILE" ; done -bash: syntax error near unexpected token `do' Adding either ; or a newline before the for corrects the syntax error. 2) When there do not exist any files of the form: /var/spool/torque/mom_priv/jobs/*.JB the command: JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) results in JOBLIST being assigned the string "/var/spool/torque/mom_priv/jobs/*.JB" Example: # ls /var/spool/torque/mom_priv/jobs/*.JB ls: cannot access /var/spool/torque/mom_priv/jobs/*.JB: No such file or directory # JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) # echo $JOBLIST /var/spool/torque/mom_priv/jobs/*.JB Modifying your solution to: JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) if [ "${JOBLIST}" != "/var/spool/torque/mom_priv/jobs/*.JB" ] ; then do ... done fi results in a correct solution, and I agree that it is more efficient than my solution. >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Michael Jennings >Sent: Monday, May 14, 2012 4:40 PM >To: torqueusers at supercluster.org >Subject: Re: [torqueusers] strategies for bad nodes > >On Friday, 11 May 2012, at 15:01:33 (+0000), Coyle, James J [ITACD] >wrote: > >> In scripts for this kind of situation (possible empty list), I >> check whether the list is empty first, and only execute the for >when >> the list is non-empty. >> >> In ksh this is done in the following way: >> replace: >> >> for JOBFILE in /var/spool/torque/mom_priv/jobs/*.JB ; do >> ... >> done >> >> with >> >> JOBLIST=`/bin/ls /var/spool/torque/mom_priv/jobs/ | grep \.JB` if >[ -n >> "${JOBLIST}" ] ; then >> for JOBFILE in "${JOBLIST} ; do >> ... >> done >> fi > >While this approach works, one of the goals of NHC is to eliminate >as many forks/subcommands as possible. The technique you suggest >will spawn 3 subprocesses in addition to the bash which is executing >the script. > >The following, however, will eliminate the unnecessary subprocesses >and accomplish the same thing. > >JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) for JOBFILE in >"${JOBLIST[@]}" ; do > ... >done > >Same effect, but much more efficient! ;-) > >HTH, >Michael > >-- >Michael Jennings >Senior HPC Systems Engineer >High-Performance Computing Services >Lawrence Berkeley National Laboratory >Bldg 50B-3209E W: 510-495-2687 >MS 050B-3209 F: 510-486-8615 >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Mon May 14 18:33:08 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 14 May 2012 17:33:08 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <242421BFAF465844BE24EB90BB97E221090915EE@ITSDAG1D.its.iastate.edu> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> <242421BFAF465844BE24EB90BB97E221090912FD@ITSDAG1D.its.iastate.edu> <20120514214007.GE5955@lbl.gov> <242421BFAF465844BE24EB90BB97E221090915EE@ITSDAG1D.its.iastate.edu> Message-ID: On May 14, 2012 7:19 PM, "Coyle, James J [ITACD]" wrote: > I had not worried about efficiency, but your solution needs to be modified > to be correct. > > There are two problems: > 1) It does not work (syntax error, easily fixed) There is no syntax error. I tested it successfully. :-) > 2) It does not work in the case asked about, that is when there are > no files of the form: /var/spool/torque/mom_priv/jobs/*.JB It does when you note that the NHC code in question (which I replaced with "...") already checks for the existence of the file as the first part of the loop. :-) > 1)Testing the proposed solution shows: > > # JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) for JOBFILE in "${JOBLIST[@]}" ; do echo "a $JOBFILE" ; done > -bash: syntax error near unexpected token `do' This is not what was in my e-mail. I'm not sure what happened to the newlines in your mail reader (or perhaps the ML software?), but the e-mail I sent had newlines after the ), the do, the ellipsis, and the done. :-) > 2) When there do not exist any files of the form: > /var/spool/torque/mom_priv/jobs/*.JB > the command: > JOBLIST=( /var/spool/torque/mom_priv/jobs/*.JB ) > results in JOBLIST being assigned the string > "/var/spool/torque/mom_priv/jobs/*.JB" See above. ;-) Michael Sent from my Droid -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120514/51e75d8b/attachment.html From mej at lbl.gov Tue May 15 08:28:27 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 15 May 2012 07:28:27 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <1336754306.51661.YahooMailClassic@web111310.mail.gq1.yahoo.com> References: <20120511162655.GF9879@lbl.gov> <1336754306.51661.YahooMailClassic@web111310.mail.gq1.yahoo.com> Message-ID: <20120515142826.GE2331@lbl.gov> On Friday, 11 May 2012, at 09:38:26 (-0700), Grigory Shamov wrote: > I've increased the $TIMEOUT to 9 sec. It seem to work, but perhaps > interacted weird with Torque; server might be having problems > interacting when modes when. > > What are the typical settings for Torque you are using on the > clusters running NHC? (And, which settings might be relevant?). > > Like, 'server node_check_rate' and 'server tcp_timeout' ? We use "5,jobstart,jobend" as our value for $node_check_interval. Is that what you're asking, or did I misunderstand? One thing we haven't tried but that may be of interest to some sites is to use a higher interval, or none at all, and rely primarily/solely on the pre-/post-job check to offline the node at the most vulnerable timepoints. I'm very interested in feedback from any sites that attempt this. We are not currently able to test this ourselves at scale but are considering options for our next major systems upgrade this summer. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From jnielsen at hudsonalpha.com Tue May 15 09:12:42 2012 From: jnielsen at hudsonalpha.com (Josh Nielsen) Date: Tue, 15 May 2012 10:12:42 -0500 Subject: [torqueusers] Jobs limited to one per node Message-ID: Hello, I noticed recently on our Torque cluster (3.0.2) that it is only allowing one job per node and it is only assigning one CPU core for each job even though there are eight per node (so it is not maxing out the resources - and is wasting/not utilizing seven cores per node). After looking around for a while I found a comment elsewhere on this mailing list about compiling torque with the --enable-cpuset flag. I read the Torque page about cpusets but am none the wiser about whether that is a required feature to allow, what I would have thought to be default functionality of allowing, more than one process/job to run on a node (and to utilize more than one core per job). If I specify any npp=* value with qsub, even if only one (like echo "sleep 60; echo test" | qsub -l nodes=1:npp=1), I get the message "qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes". And during the course of scheduling jobs, once there are more jobs requested than there are nodes then they are listed as queued and in the sched_log/ log files I see "Not enough of the right type of nodes available" for each new request. I also tried adding np=8 to each of the nodes listed in server_priv/nodes since I had not before, but it did not change anything. I will post my Torque config below, but I'm curious to know if --enable-cpuset is what I need, since it is not made explicit that it is a required flag to allow more than one job to run per node. Setting the default and max settings was my attempt to get this working, although we have another cluster that doesn't specify any of that and it runs as expected by reserving the amount of cpus per node that you request with npp. qmgr -c "print server" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodes = 2 set queue batch resources_min.ncpus = 1 set queue batch resources_default.ncpus = 1 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 32:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = penguin-head01.compute.haib.org set server managers = root at penguin-head01.compute.haib.org set server operators = root at penguin-head01.compute.haib.org set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 554 ------------------------ qmgr -c "list server" Server penguin-head01.compute.haib.org server_state = Active scheduling = True total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 acl_hosts = penguin-head01.compute.haib.org managers = root at penguin-head01.compute.haib.org operators = root at penguin-head01.compute.haib.org default_queue = batch log_events = 511 mail_from = adm resources_assigned.ncpus = 0 resources_assigned.nodect = 0 scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 mom_job_sync = True pbs_version = 3.0.2 keep_completed = 300 next_job_number = 554 net_counter = 2 0 0 Any suggestions would be appreciated! Thanks, Josh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120515/056df568/attachment-0001.html From download at hpc.unm.edu Tue May 15 09:41:03 2012 From: download at hpc.unm.edu (Jim Prewett) Date: Tue, 15 May 2012 09:41:03 -0600 (MDT) Subject: [torqueusers] Jobs limited to one per node In-Reply-To: References: Message-ID: Hello, What does your server_priv/nodes file look like? My /guess/ is that your problem lies there. That file should contain a list of nodes followed by the number of processors in the node, e.g: compute001 np=8 compute002 np=8 ... HTH, Jim James E. Prewett Jim at Prewett.org download at hpc.unm.edu Systems Team Leader LoGS: http://www.hpc.unm.edu/~download/LoGS/ Designated Security Officer OpenPGP key: pub 1024D/31816D93 HPC Systems Engineer III UNM HPC 505.277.8210 On Tue, 15 May 2012, Josh Nielsen wrote: > Hello, > > I noticed recently on our Torque cluster (3.0.2) that it is only allowing > one job per node and it is only assigning one CPU core for each job even > though there are eight per node (so it is not maxing out the resources - > and is wasting/not utilizing seven cores per node). After looking around > for a while I found a comment elsewhere on this mailing list about > compiling torque with the --enable-cpuset flag. I read the Torque page > about cpusets but am none the wiser about whether that is a required > feature to allow, what I would have thought to be default functionality of > allowing, more than one process/job to run on a node (and to utilize more > than one core per job). > > If I specify any npp=* value with qsub, even if only one (like echo "sleep > 60; echo test" | qsub -l nodes=1:npp=1), I get the message "qsub: Job > exceeds queue resource limits MSG=cannot locate feasible nodes". And during > the course of scheduling jobs, once there are more jobs requested than > there are nodes then they are listed as queued and in the sched_log/ log > files I see "Not enough of the right type of nodes available" for each new > request. I also tried adding np=8 to each of the nodes listed in > server_priv/nodes since I had not before, but it did not change anything. > > I will post my Torque config below, but I'm curious to know if > --enable-cpuset is what I need, since it is not made explicit that it is a > required flag to allow more than one job to run per node. Setting the > default and max settings was my attempt to get this working, although we > have another cluster that doesn't specify any of that and it runs as > expected by reserving the amount of cpus per node that you request with npp. > > qmgr -c "print server" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_max.ncpus = 8 > set queue batch resources_max.nodes = 2 > set queue batch resources_min.ncpus = 1 > set queue batch resources_default.ncpus = 1 > set queue batch resources_default.nodect = 1 > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 32:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = penguin-head01.compute.haib.org > set server managers = root at penguin-head01.compute.haib.org > set server operators = root at penguin-head01.compute.haib.org > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 554 > ------------------------ > > qmgr -c "list server" > Server penguin-head01.compute.haib.org > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = penguin-head01.compute.haib.org > managers = root at penguin-head01.compute.haib.org > operators = root at penguin-head01.compute.haib.org > default_queue = batch > log_events = 511 > mail_from = adm > resources_assigned.ncpus = 0 > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 3.0.2 > keep_completed = 300 > next_job_number = 554 > net_counter = 2 0 0 > > > Any suggestions would be appreciated! > > Thanks, > Josh > From mej at lbl.gov Tue May 15 09:54:00 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 15 May 2012 08:54:00 -0700 Subject: [torqueusers] strategies for bad nodes In-Reply-To: <4FAD6E45.60609@fysik.dtu.dk> References: <20120503212548.GQ9750@lbl.gov> <4FACBC84.7010604@fysik.dtu.dk> <20120511162655.GF9879@lbl.gov> <4FAD6E45.60609@fysik.dtu.dk> Message-ID: <20120515155359.GA3770@lbl.gov> On Friday, 11 May 2012, at 21:53:41 (+0200), Ole Holm Nielsen wrote: > Perhaps, but the attachment seems to be missing. Is the patch in the SVN now? Wow, I'm on a roll, aren't I? :-( Hopefully I won't forget to attach the file again this time. > Sounds really good. A simple question: Could you specify explicitly > on the NHC Wiki page exactly the command used to download NHC SVN > versions like 1.2? I'm not so familiar with Trac, so maybe I'm > overlooking something obvious. Sure, it's here: svn co https://warewulf.lbl.gov/svn/trunk/nhc HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 -------------- next part -------------- Index: scripts/ww_job.nhc =================================================================== --- scripts/ww_job.nhc (revision 948) +++ scripts/ww_job.nhc (working copy) @@ -15,10 +15,12 @@ local JOBFILE local JOBUSER local LINE + local -a JOBLIST JOBUSERS=( root nobody ) - for JOBFILE in $JOBFILE_PATH/*.JB ; do - if [[ ! -f $JOBFILE ]]; then + JOBLIST=( $JOBFILE_PATH/*.JB ) + for JOBFILE in "${JOBLIST[@]}" ; do + if [[ ! -f "$JOBFILE" ]]; then continue fi while read LINE ; do @@ -30,7 +32,7 @@ JOBUSERS[${#JOBUSERS[*]}]="$JOBUSER" fi break - done < $JOBFILE + done < "$JOBFILE" done dbg "Authorized users are: ${JOBUSERS[*]}" } From jnielsen at hudsonalpha.com Tue May 15 09:54:08 2012 From: jnielsen at hudsonalpha.com (Josh Nielsen) Date: Tue, 15 May 2012 10:54:08 -0500 Subject: [torqueusers] Jobs limited to one per node In-Reply-To: References: Message-ID: Hi Jim, Thanks for responding. Yeah, as I mentioned I recently added the np= arguments to that file because I noticed that they were not set before. My file looks like this now: mendel01 np=8 mendel02 np=8 mendel03 np=8 mendel04 np=8 mendel05 np=8 And I restarted the pbs_server and pbs_sched daemons after I made that change. Any other ideas? -Josh >Hello, >What does your server_priv/nodes file look like? My /guess/ is that your >problem lies there. >That file should contain a list of nodes followed by the number of >processors in the node, e.g: >compute001 np=8 >compute002 np=8 >... >HTH, >Jim >James E. Prewett Jim at Prewett.org download at hpc.unm.edu >Systems Team Leader LoGS: http://www.hpc.unm.edu/~download/LoGS/ >Designated Security Officer OpenPGP key: pub 1024D/31816D93 >HPC Systems Engineer III UNM HPC 505.277.8210 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120515/1b5c136c/attachment.html From gus at ldeo.columbia.edu Tue May 15 10:04:25 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 15 May 2012 12:04:25 -0400 Subject: [torqueusers] Jobs limited to one per node In-Reply-To: References: Message-ID: <4FB27E89.40902@ldeo.columbia.edu> On 05/15/2012 11:12 AM, Josh Nielsen wrote: > Hello, > > I noticed recently on our Torque cluster (3.0.2) that it is only > allowing one job per node and it is only assigning one CPU core for each > job even though there are eight per node (so it is not maxing out the > resources - and is wasting/not utilizing seven cores per node). After > looking around for a while I found a comment elsewhere on this mailing > list about compiling torque with the --enable-cpuset flag. I read the > Torque page about cpusets but am none the wiser about whether that is a > required feature to allow, what I would have thought to be default > functionality of allowing, more than one process/job to run on a node > (and to utilize more than one core per job). > > If I specify any npp=* value with qsub, even if only one (like echo > "sleep 60; echo test" | qsub -l nodes=1:npp=1), Hi Josh Is the above a typo in your email or in your qsub command? ['npp' instead of 'ppn'] I guess the syntax should be: qsub -l nodes=1:ppn=8 I hope this helps, Gus Correa I get the message "qsub: > Job exceeds queue resource limits MSG=cannot locate feasible nodes". And > during the course of scheduling jobs, once there are more jobs requested > than there are nodes then they are listed as queued and in the > sched_log/ log files I see "Not enough of the right type of nodes > available" for each new request. I also tried adding np=8 to each of the > nodes listed in server_priv/nodes since I had not before, but it did not > change anything. > > I will post my Torque config below, but I'm curious to know if > --enable-cpuset is what I need, since it is not made explicit that it is > a required flag to allow more than one job to run per node. Setting the > default and max settings was my attempt to get this working, although we > have another cluster that doesn't specify any of that and it runs as > expected by reserving the amount of cpus per node that you request with npp. > > qmgr -c "print server" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_max.ncpus = 8 > set queue batch resources_max.nodes = 2 > set queue batch resources_min.ncpus = 1 > set queue batch resources_default.ncpus = 1 > set queue batch resources_default.nodect = 1 > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 32:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = penguin-head01.compute.haib.org > > set server managers = root at penguin-head01.compute.haib.org > > set server operators = root at penguin-head01.compute.haib.org > > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 554 > ------------------------ > > qmgr -c "list server" > Server penguin-head01.compute.haib.org > > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = penguin-head01.compute.haib.org > > managers = root at penguin-head01.compute.haib.org > > operators = root at penguin-head01.compute.haib.org > > default_queue = batch > log_events = 511 > mail_from = adm > resources_assigned.ncpus = 0 > resources_assigned.nodect = 0 > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 3.0.2 > keep_completed = 300 > next_job_number = 554 > net_counter = 2 0 0 > > > Any suggestions would be appreciated! > > Thanks, > Josh > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jnielsen at hudsonalpha.com Tue May 15 10:16:59 2012 From: jnielsen at hudsonalpha.com (Josh Nielsen) Date: Tue, 15 May 2012 11:16:59 -0500 Subject: [torqueusers] Jobs limited to one per node In-Reply-To: <4FB27E89.40902@ldeo.columbia.edu> References: <4FB27E89.40902@ldeo.columbia.edu> Message-ID: Ah, yes that was a typo on the command line. It works with ppn=1 and ppn=8 (which is confusing because you use 'np' in the server_priv/nodes file but ppn on the command line). However we have software that submits our jobs in an automated fashion to the cluster. I just checked it again and it seems to be working now, I have seven jobs actively running across my five nodes. The only thing I changed this morning was to remove the the "max_running = 20" command from the queue. In any case it seems to be working now, although I thought I had tested everything yesterday and it was still not working. Odd how that happens sometimes. Thanks for the suggestions though! -Josh On Tue, May 15, 2012 at 11:04 AM, Gus Correa wrote: > On 05/15/2012 11:12 AM, Josh Nielsen wrote: > > Hello, > > > > I noticed recently on our Torque cluster (3.0.2) that it is only > > allowing one job per node and it is only assigning one CPU core for each > > job even though there are eight per node (so it is not maxing out the > > resources - and is wasting/not utilizing seven cores per node). After > > looking around for a while I found a comment elsewhere on this mailing > > list about compiling torque with the --enable-cpuset flag. I read the > > Torque page about cpusets but am none the wiser about whether that is a > > required feature to allow, what I would have thought to be default > > functionality of allowing, more than one process/job to run on a node > > (and to utilize more than one core per job). > > > > If I specify any npp=* value with qsub, even if only one (like echo > > "sleep 60; echo test" | qsub -l nodes=1:npp=1), > Hi Josh > > Is the above a typo in your email or in your qsub command? > ['npp' instead of 'ppn'] > I guess the syntax should be: > > qsub -l nodes=1:ppn=8 > > I hope this helps, > Gus Correa > > > I get the message "qsub: > > Job exceeds queue resource limits MSG=cannot locate feasible nodes". And > > during the course of scheduling jobs, once there are more jobs requested > > than there are nodes then they are listed as queued and in the > > sched_log/ log files I see "Not enough of the right type of nodes > > available" for each new request. I also tried adding np=8 to each of the > > nodes listed in server_priv/nodes since I had not before, but it did not > > change anything. > > > > I will post my Torque config below, but I'm curious to know if > > --enable-cpuset is what I need, since it is not made explicit that it is > > a required flag to allow more than one job to run per node. Setting the > > default and max settings was my attempt to get this working, although we > > have another cluster that doesn't specify any of that and it runs as > > expected by reserving the amount of cpus per node that you request with > npp. > > > > qmgr -c "print server" > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch resources_max.ncpus = 8 > > set queue batch resources_max.nodes = 2 > > set queue batch resources_min.ncpus = 1 > > set queue batch resources_default.ncpus = 1 > > set queue batch resources_default.nodect = 1 > > set queue batch resources_default.nodes = 1 > > set queue batch resources_default.walltime = 32:00:00 > > set queue batch enabled = True > > set queue batch started = True > > # > > # Set server attributes. > > # > > set server scheduling = True > > set server acl_hosts = penguin-head01.compute.haib.org > > > > set server managers = root at penguin-head01.compute.haib.org > > > > set server operators = root at penguin-head01.compute.haib.org > > > > set server default_queue = batch > > set server log_events = 511 > > set server mail_from = adm > > set server scheduler_iteration = 600 > > set server node_check_rate = 150 > > set server tcp_timeout = 6 > > set server mom_job_sync = True > > set server keep_completed = 300 > > set server next_job_number = 554 > > ------------------------ > > > > qmgr -c "list server" > > Server penguin-head01.compute.haib.org > > > > server_state = Active > > scheduling = True > > total_jobs = 0 > > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > > acl_hosts = penguin-head01.compute.haib.org > > > > managers = root at penguin-head01.compute.haib.org > > > > operators = root at penguin-head01.compute.haib.org > > > > default_queue = batch > > log_events = 511 > > mail_from = adm > > resources_assigned.ncpus = 0 > > resources_assigned.nodect = 0 > > scheduler_iteration = 600 > > node_check_rate = 150 > > tcp_timeout = 6 > > mom_job_sync = True > > pbs_version = 3.0.2 > > keep_completed = 300 > > next_job_number = 554 > > net_counter = 2 0 0 > > > > > > Any suggestions would be appreciated! > > > > Thanks, > > Josh > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120515/12c030ab/attachment-0001.html From jfarran at uci.edu Wed May 16 13:09:22 2012 From: jfarran at uci.edu (Joseph Farran) Date: Wed, 16 May 2012 12:09:22 -0700 Subject: [torqueusers] Schedulers Message-ID: <4FB3FB62.2060907@uci.edu> Hello. Are the only 3 schedulers available that work with Torque resource manager? Maui ( free but not very well supported ) Moab ( paid version ) Torque PBS mom ( free but simple scheduler ) Are there any other free schedulers that work with Torque, or are these the only 3? Joseph From jjc at iastate.edu Wed May 16 13:35:13 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Wed, 16 May 2012 19:35:13 +0000 Subject: [torqueusers] Schedulers In-Reply-To: <4FB3FB62.2060907@uci.edu> References: <4FB3FB62.2060907@uci.edu> Message-ID: <242421BFAF465844BE24EB90BB97E22109091993@ITSDAG1D.its.iastate.edu> Clarification: The scheduler you mean to say is Torque pbs_sched not pbs_mom pbs_mom runs on the compute nodes, interacting with pbs_server and whatever scheduler you are running. >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Joseph Farran >Sent: Wednesday, May 16, 2012 2:09 PM >To: torqueusers at supercluster.org >Subject: [torqueusers] Schedulers > >Hello. > >Are the only 3 schedulers available that work with Torque resource >manager? > > Maui ( free but not very well supported ) > Moab ( paid version ) > Torque PBS mom ( free but simple scheduler ) > >Are there any other free schedulers that work with Torque, or are >these the only 3? > >Joseph > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From jfarran at uci.edu Wed May 16 13:52:27 2012 From: jfarran at uci.edu (Joseph Farran) Date: Wed, 16 May 2012 12:52:27 -0700 Subject: [torqueusers] Schedulers In-Reply-To: <242421BFAF465844BE24EB90BB97E22109091993@ITSDAG1D.its.iastate.edu> References: <4FB3FB62.2060907@uci.edu> <242421BFAF465844BE24EB90BB97E22109091993@ITSDAG1D.its.iastate.edu> Message-ID: <4FB4057B.3000509@uci.edu> Yes - thank you for the correction. The simple PBS scheduler "pbs_sched". So are these the only 3 schedulers that work with Torque? On 05/16/2012 12:35 PM, Coyle, James J [ITACD] wrote: > Clarification: > > The scheduler you mean to say is > > Torque pbs_sched not pbs_mom > > pbs_mom runs on the compute nodes, interacting with > pbs_server and whatever scheduler you are running. > > > >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >> bounces at supercluster.org] On Behalf Of Joseph Farran >> Sent: Wednesday, May 16, 2012 2:09 PM >> To: torqueusers at supercluster.org >> Subject: [torqueusers] Schedulers >> >> Hello. >> >> Are the only 3 schedulers available that work with Torque resource >> manager? >> >> Maui ( free but not very well supported ) >> Moab ( paid version ) >> Torque PBS mom ( free but simple scheduler ) >> >> Are there any other free schedulers that work with Torque, or are >> these the only 3? >> >> Joseph >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From tbaer at utk.edu Wed May 16 13:56:50 2012 From: tbaer at utk.edu (Troy Baer) Date: Wed, 16 May 2012 15:56:50 -0400 Subject: [torqueusers] Schedulers In-Reply-To: <4FB3FB62.2060907@uci.edu> References: <4FB3FB62.2060907@uci.edu> Message-ID: <1337198210.12896.418.camel@browncoat.jics.utk.edu> On Wed, 2012-05-16 at 12:09 -0700, Joseph Farran wrote: > Are the only 3 schedulers available that work with Torque resource > manager? > > Maui ( free but not very well supported ) > Moab ( paid version ) > Torque PBS mom ( free but simple scheduler ) > > Are there any other free schedulers that work with Torque, or are > these the only 3? There are a couple others I know of: * Catalina -- http://www.sdsc.edu/catalina/ * PluS -- http://www.g-lambda.net/plus/ There is also the option to write your own, either in the PBS BASL language or in your language of choice (C, Perl, Python, etc.) However, Moab and Maui are probably the most widely used IMHO. --Troy -- Troy Baer, Senior HPC System Administrator National Institute for Computational Sciences, University of Tennessee http://www.nics.tennessee.edu/ Phone: 865-241-4233 From knielson at adaptivecomputing.com Wed May 16 14:26:24 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 16 May 2012 14:26:24 -0600 Subject: [torqueusers] nodes file persistent gpus setting In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102DCC742DB@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102DCC742DB@exvic-mbx04.nexus.csiro.au> Message-ID: On Sun, Apr 1, 2012 at 7:36 PM, wrote: > Hi, > > Can anyone confirm the following behavior (bug)? > > If you give a node gpus like so: > qmgr -c 'set node gpunode01 gpus = 2' > or in the nodes file > gpunode01 np=12 gpus=2 > Then the node has (logical) gpus defined and they can be scheduled as in: > > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php > (though 1.5.3 doesn't mention specifying both np= and gpus= which I > suspect needs fixing). > > This setup works fine for us until we restart the pbs_server at which time > the gpus disappear (you can see this in the output of pbsnodes). The nodes > file gets altered to remove the gpus= setting. > > Note that we are using version 3.0.3-snap.xxx and NOT the integrated > nvidia gpu support. > > Does anyone else see the behavior? You don't need physical gpus to test, > just a system you are prepared to mess with a little including restarting > the pbs_server. > > Regards, > > Gareth > Gareth, Have you entered a ticket in bugzilla for this. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120516/c14ed855/attachment.html From sean.reilly at ersa.edu.au Wed May 16 20:21:18 2012 From: sean.reilly at ersa.edu.au (Sean Reilly) Date: Thu, 17 May 2012 11:51:18 +0930 Subject: [torqueusers] nodes file persistent gpus setting In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102DCC742DB@exvic-mbx04.nexus.csiro.au> Message-ID: <4FB4609E.8080807@ersa.edu.au> Hi Gareth We saw the same behaviour when we enabled the tdk-1.285 libraries on the GPU backend Nodes in the ld.config path. - It is needed on the CPU (non-gpu) Nodes - But when added to the PATH on the GPU Nodes - the PBS_MOM complains about something missing (*Sorry I cant remember what it is - but it may have been some nvidia or nvc nvq type library*) - Then the PBS_MOM rewrites the nodes file on the server side. *removing the gpus= or truncating the line from where 'gpus=' is written* this was fixed by commenting out these libs on the GPU backend Node. /etc/ld.so.conf.d/tdk.conf #This file was made by puppet, do not edit it directly! #/opt/shared/tdk/1.285/lib64 #/opt/shared/tdk/1.285/lib Regards Sean On 17/05/12 05:56, Ken Nielson wrote: > On Sun, Apr 1, 2012 at 7:36 PM, > wrote: > > Hi, > > Can anyone confirm the following behavior (bug)? > > If you give a node gpus like so: > qmgr -c 'set node gpunode01 gpus = 2' > or in the nodes file > gpunode01 np=12 gpus=2 > Then the node has (logical) gpus defined and they can be scheduled > as in: > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php > (though 1.5.3 doesn't mention specifying both np= and gpus= which > I suspect needs fixing). > > This setup works fine for us until we restart the pbs_server at > which time the gpus disappear (you can see this in the output of > pbsnodes). The nodes file gets altered to remove the gpus= setting. > > Note that we are using version 3.0.3-snap.xxx and NOT the > integrated nvidia gpu support. > > Does anyone else see the behavior? You don't need physical gpus > to test, just a system you are prepared to mess with a little > including restarting the pbs_server. > > Regards, > > Gareth > > > Gareth, > > Have you entered a ticket in bugzilla for this. > > Ken > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- *Sean Reilly* Systems Administrator & Applications Support Officer eResearchSA Phone : +61 8 8313 8352 Mobile: +61 450 840 246 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120517/4657cabd/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: Email-moved.png Type: image/png Size: 10004 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120517/4657cabd/attachment-0001.png From Gareth.Williams at csiro.au Wed May 16 23:09:50 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 17 May 2012 15:09:50 +1000 Subject: [torqueusers] nodes file persistent gpus setting In-Reply-To: References: <007DECE986B47F4EABF823C1FBB19C620102DCC742DB@exvic-mbx04.nexus.csiro.au> Message-ID: <007DECE986B47F4EABF823C1FBB19C62010503312B38@exvic-mbx04.nexus.csiro.au> No I did not make a ticket. I've not had an environment to reproduce the issue in a controlled way so was hoping to get some independent confirmation. I couldn't reproduce it on a stand-alone single host system. It is still an a issue for us but not a show-stopper. Gareth Ps. Grr html mail - probably my fault for including link in the first place without making sure to suppress html. From: Ken Nielson [mailto:knielson at adaptivecomputing.com] Sent: Thursday, 17 May 2012 6:26 AM To: Torque Users Mailing List Subject: Re: [torqueusers] nodes file persistent gpus setting On Sun, Apr 1, 2012 at 7:36 PM, > wrote: Hi, Can anyone confirm the following behavior (bug)? If you give a node gpus like so: qmgr -c 'set node gpunode01 gpus = 2' or in the nodes file gpunode01 np=12 gpus=2 Then the node has (logical) gpus defined and they can be scheduled as in: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php (though 1.5.3 doesn't mention specifying both np= and gpus= which I suspect needs fixing). This setup works fine for us until we restart the pbs_server at which time the gpus disappear (you can see this in the output of pbsnodes). The nodes file gets altered to remove the gpus= setting. Note that we are using version 3.0.3-snap.xxx and NOT the integrated nvidia gpu support. Does anyone else see the behavior? You don't need physical gpus to test, just a system you are prepared to mess with a little including restarting the pbs_server. Regards, Gareth Gareth, Have you entered a ticket in bugzilla for this. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120517/304f66d8/attachment.html From Gareth.Williams at csiro.au Wed May 16 23:50:09 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 17 May 2012 15:50:09 +1000 Subject: [torqueusers] nodes file persistent gpus setting In-Reply-To: <4FB4609E.8080807@ersa.edu.au> References: <007DECE986B47F4EABF823C1FBB19C620102DCC742DB@exvic-mbx04.nexus.csiro.au> <4FB4609E.8080807@ersa.edu.au> Message-ID: <007DECE986B47F4EABF823C1FBB19C62010503312B3A@exvic-mbx04.nexus.csiro.au> HI Sean, Woah - we are _not_ using the integrated nvidia gpu support (so far anyway). Perhaps that wasn't actually the problem on your system - are you really sure that solved the problem and was not just a coincidence? We have nvidia drivers (on that compute node) but no other nvidia software on this system. Gareth From: Sean Reilly [mailto:sean.reilly at ersa.edu.au] Sent: Thursday, 17 May 2012 12:21 PM To: Torque Users Mailing List Subject: Re: [torqueusers] nodes file persistent gpus setting Hi Gareth We saw the same behaviour when we enabled the tdk-1.285 libraries on the GPU backend Nodes in the ld.config path. - It is needed on the CPU (non-gpu) Nodes - But when added to the PATH on the GPU Nodes - the PBS_MOM complains about something missing (*Sorry I cant remember what it is - but it may have been some nvidia or nvc nvq type library*) - Then the PBS_MOM rewrites the nodes file on the server side. *removing the gpus= or truncating the line from where 'gpus=' is written* this was fixed by commenting out these libs on the GPU backend Node. /etc/ld.so.conf.d/tdk.conf #This file was made by puppet, do not edit it directly! #/opt/shared/tdk/1.285/lib64 #/opt/shared/tdk/1.285/lib Regards Sean On 17/05/12 05:56, Ken Nielson wrote: On Sun, Apr 1, 2012 at 7:36 PM, > wrote: Hi, Can anyone confirm the following behavior (bug)? If you give a node gpus like so: qmgr -c 'set node gpunode01 gpus = 2' or in the nodes file gpunode01 np=12 gpus=2 Then the node has (logical) gpus defined and they can be scheduled as in: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/1.5nodeconfig.php (though 1.5.3 doesn't mention specifying both np= and gpus= which I suspect needs fixing). This setup works fine for us until we restart the pbs_server at which time the gpus disappear (you can see this in the output of pbsnodes). The nodes file gets altered to remove the gpus= setting. Note that we are using version 3.0.3-snap.xxx and NOT the integrated nvidia gpu support. Does anyone else see the behavior? You don't need physical gpus to test, just a system you are prepared to mess with a little including restarting the pbs_server. Regards, Gareth Gareth, Have you entered a ticket in bugzilla for this. Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -- Sean Reilly Systems Administrator & Applications Support Officer eResearchSA Phone : +61 8 8313 8352 Mobile: +61 450 840 246 [cid:image001.png at 01CD343F.A75DAA90] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120517/5873e1df/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 10004 bytes Desc: image001.png Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120517/5873e1df/attachment-0001.png From bdandrus at nps.edu Thu May 17 13:05:17 2012 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Thu, 17 May 2012 19:05:17 +0000 Subject: [torqueusers] LDAP users and torque In-Reply-To: <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> Message-ID: Actually here, I think you need to restart pbs_server, not pbs_mom. Ensure you can see the user on the head node (getent passwd testuser) Then restart pbs_server and give it a shot. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Baker D.J. Sent: Wednesday, May 09, 2012 9:14 AM To: torqueusers at supercluster.org Subject: [torqueusers] FW: LDAP users and torque Hello, On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? Best regards -- David. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From simon.brennan at ersa.edu.au Thu May 17 17:05:54 2012 From: simon.brennan at ersa.edu.au (Simon Brennan) Date: Fri, 18 May 2012 08:35:54 +0930 Subject: [torqueusers] LDAP users and torque In-Reply-To: References: <9230853A36E499458AF5EAD38C13DCD5022A1DF5@UOS-MSG00039-SI.soton.ac.uk> <9230853A36E499458AF5EAD38C13DCD5022A1E31@UOS-MSG00039-SI.soton.ac.uk> Message-ID: <4FB58452.8030905@ersa.edu.au> On my old CentOS 5 (5.1) cluster I had some LDAP troubles. Mainly I believe due to bugs in Openldap. Generally when ldap went awry I would invalidate the passwd cache / table. nscd -i passwd This is much safer than restarting the nscd daemon. As Andrus suggested, make sure your system can see the ldap accounts with getent. Worth a try Simon Brennan System Administrator eResearchSA Hedge House 14 Little Queen Street University of Adelaide Thebarton, SA 5031 On 05/18/2012 04:35 AM, Andrus, Brian Contractor wrote: > Actually here, I think you need to restart pbs_server, not pbs_mom. > > Ensure you can see the user on the head node (getent passwd testuser) > Then restart pbs_server and give it a shot. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Baker D.J. > Sent: Wednesday, May 09, 2012 9:14 AM > To: torqueusers at supercluster.org > Subject: [torqueusers] FW: LDAP users and torque > > Hello, > > On our RedHat Linux 5* cluster we are running torque 2.5.7, and moab 6.1.4. I have recently set up an LDAP server, and I'm starting to populate it with users and groups for a consortium project that we're working on. I'm already using NIS for university users, and LDAP is intended for external people. > > I set up the LDAP client configuration files on the login nodes, the head node (running torque and moab) and (as a test) on one of the compute nodes. I double checked that I could "see" LDAP users on these client nodes using getent. I then logged in to one of the login nodes as an LDAP user and attempted to submit a job. > > My initial attempt to submit a job failed - the error message indicated that torque could not find the user in the /etc/passwd on the node running the torque server. Given that I had set up ldap.conf and nsswitch.conf on that node, this seems odd. As a work around I placed the LDAP users in the /etc/passwd file on the torque server node and all went well. > > Could someone please help me to understand what might be wrong here. Do I need to configure torque in a particular way to fully integrate with LDAP, for example? > > Best regards -- David. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lemgabri at gmail.com Thu May 17 13:58:26 2012 From: lemgabri at gmail.com (John Smith) Date: Thu, 17 May 2012 20:58:26 +0100 Subject: [torqueusers] TORQUE 4.0.1 does not allow submitting jobs as root In-Reply-To: References: Message-ID: Hi, I tried to allow job submission by root by executing: $> qmgr -c 's s acl_roots += root@*' but it doesn't seem to work! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120517/7b6f9dd9/attachment.html From andrew.m.melo at vanderbilt.edu Fri May 18 16:05:18 2012 From: andrew.m.melo at vanderbilt.edu (Andrew Melo) Date: Fri, 18 May 2012 16:05:18 -0600 Subject: [torqueusers] Stageout parameters for qsub? Message-ID: Hello all, I'm working on some code to integrate my experiment's software with our cluster, and I'm having trouble figuring out the syntax for the -W stagein and -W stageout arguments. The man page claims: stagein=file_list > stageout=file_list > Specifies which files are staged (copied) in before job > start or staged out after the job completes execution. On completion of > the job, > all staged-in and staged-out files are removed from the > execution system. The file_list is in the form > local_file at hostname:remote_file[,...] > regardless of the direction of the copy. However, with these stanzas: #PBS -W stagein=stagein1.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stagein1.txt,stagein2.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stagein2.txt and #PBS -W stageout=/tmp/stageout1.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stageout1.txt,/tmp/stageout2.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stageout2.txt I get vmps07:v1] qsub stageout.pbs > qsub: illegal -W value (this happens with either stanza and also with both of them in the same PBS script) Am I missing something really obvious? If it helps, my qsub is this version: vmps07:v1] qsub --version > version: 2.5.9 Thanks a lot! Andrew -- -- Andrew Melo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120518/ec103f36/attachment-0001.html From andrew.lahiff at stfc.ac.uk Fri May 18 18:02:29 2012 From: andrew.lahiff at stfc.ac.uk (andrew.lahiff at stfc.ac.uk) Date: Sat, 19 May 2012 00:02:29 +0000 Subject: [torqueusers] problems submitting jobs with "15033 - Batch protocol error" Message-ID: Hi, We are using torque and maui for a cluster with 826 hosts and 7440 job slots in total. The torque version is 2.5.10. We usually have a total of around 10000 running + pending jobs. For a small percentage of jobs, qsub fails with the error ?15033 - Batch protocol error?. Does anyone know what could be causing this? Enough job submissions fail in this way for it to be a problem for us. I've done a lot of searching but haven't been able to find anything about this error. Many Thanks, Andrew. -- Scanned by iCritical. From dwallor at gmail.com Sat May 19 22:21:10 2012 From: dwallor at gmail.com (David Allor) Date: Sun, 20 May 2012 00:21:10 -0400 Subject: [torqueusers] Jobs have errors on client node. Message-ID: Hi! ?I've played with torque before and understand a lot of the configuration, but I am guessing I have missed something major in setting up my client node. ?Every job which requests the server node runs perfectly, but the queue list says error for every process which requests the made-up 'vision' property for the client node. I keep getting this same lines over and over in syslog with a different filename. May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/pbs/spool/167144.davidbook.ER david at DavidBook://Camera.e167144' failed with status=1, giving up after 4 attempts May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/spool/pbs/spool/167144.davidbook.ER to david at DavidBook://Camera.e167144 What gets to me is the "david at DavidBook://Camera.e167144". ??? ??? ??? ???? ?? ?? ???? ?? ??????? Here's some server log: 05/19/2012 23:32:40;0010;PBS_Server;Job;167111.davidbook;Exit_status=-2 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00 05/19/2012 23:32:40;000d;PBS_Server;Job;167111.davidbook;Email 'a' to david at DavidBook failed: Child process 'sendmail -f adm david at DavidBook' returned 127 (errno 10:No child processes) 05/19/2012 23:32:42;000d;PBS_Server;Job;167109.davidbook;Post job file processing error; job 167109.davidbook on host MediaBox/3 05/19/2012 23:32:42;0040;PBS_Server;Svr;davidbook;Scheduler was sent the command term 05/19/2012 23:32:42;000d;PBS_Server;Job;167109.davidbook;Email 'o' to david at DavidBook failed: Child process 'sendmail -f adm david at DavidBook' returned 127 (errno 10:No child processes) 05/19/2012 23:32:42;0010;PBS_Server;Job;167105.davidbook;Exit_status=-2 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:19 05/19/2012 23:32:42;000d;PBS_Server;Job;167105.davidbook;Email 'a' to david at DavidBook failed: Child process 'sendmail -f adm david at DavidBook' returned 127 (errno 10:No child processes) 05/19/2012 23:32:43;0100;PBS_Server;Job;167109.davidbook;dequeuing from batch, state COMPLETE 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent the command term 05/19/2012 23:32:43;0100;PBS_Server;Job;167112.davidbook;enqueuing into batch, state 1 hop 1 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Queued at request of david at DavidBook, owner = david at DavidBook, job name = VisionProcessManager, queue = batch 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent the command new 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Modified at request of Scheduler at DavidBook 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Run at request of Scheduler at DavidBook 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent the command recyc 05/19/2012 23:32:43;0010;PBS_Server;Job;167112.davidbook;Exit_status=-2 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:01 05/19/2012 23:32:43;000d;PBS_Server;Job;167112.davidbook;Email 'a' to david at DavidBook failed: Child process 'sendmail -f adm david at DavidBook' returned 127 (errno 10:No child processes) And here's mom log: 05/19/2012 23:32:29;0080; ? pbs_mom;Job;167104.davidbook;obit sent to server 05/19/2012 23:32:31;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 05/19/2012 23:32:31;0001; ? pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry (see syslog for more information) 05/19/2012 23:32:31;0008; ? pbs_mom;Req;send_sisters;sending ABORT to sisters for job 167105.davidbook 05/19/2012 23:32:31;0080; ? pbs_mom;Svr;preobit_reply;top of preobit_reply 05/19/2012 23:32:31;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 05/19/2012 23:32:31;0080; ? pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 05/19/2012 23:32:31;0001; ? pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 1023 in client_to_svr - connection refused 05/19/2012 23:32:33;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 05/19/2012 23:32:33;0001; ? pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, after files staged, no retry (see syslog for more information) 05/19/2012 23:32:33;0008; ? pbs_mom;Req;send_sisters;sending ABORT to sisters for job 167106.davidbook 05/19/2012 23:32:33;0080; ? pbs_mom;Svr;preobit_reply;top of preobit_reply 05/19/2012 23:32:33;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 05/19/2012 23:32:33;0080; ? pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 05/19/2012 23:32:33;0080; ? pbs_mom;Job;167106.davidbook;obit sent to server 05/19/2012 23:32:36;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 05/19/2012 23:32:36;0001; ? pbs From dwallor at gmail.com Sat May 19 22:24:37 2012 From: dwallor at gmail.com (David Allor) Date: Sun, 20 May 2012 00:24:37 -0400 Subject: [torqueusers] Jobs have errors on client node. In-Reply-To: References: Message-ID: For some reason part my last message came out in another alphabet. I meant to say, of "david at DavidBook://Camera.e167144", where does that point? Is it missing the directory? On Sun, May 20, 2012 at 12:21 AM, David Allor wrote: > Hi! ?I've played with torque before and understand a lot of the > configuration, but I am guessing I have missed something major in > setting up my client node. ?Every job which requests the server node > runs perfectly, but the queue list says error for every process which > requests the made-up 'vision' property for the client node. > > I keep getting this same lines over and over in syslog with a > different filename. > > May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::sys_copy, command > '/usr/bin/scp -rpB /var/spool/pbs/spool/167144.davidbook.ER > david at DavidBook://Camera.e167144' failed with status=1, giving up > after 4 attempts > May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::req_cpyfile, Unable to > copy file /var/spool/pbs/spool/167144.davidbook.ER to > david at DavidBook://Camera.e167144 > What gets to me is the "david at DavidBook://Camera.e167144". ???? ??? > ??? ???? ?? ?? ???? ?? ??????? > > Here's some server log: > 05/19/2012 23:32:40;0010;PBS_Server;Job;167111.davidbook;Exit_status=-2 > resources_used.cput=00:00:00 resources_used.mem=0kb > resources_used.vmem=0kb resources_used.walltime=00:00:00 > 05/19/2012 23:32:40;000d;PBS_Server;Job;167111.davidbook;Email 'a' to > david at DavidBook failed: Child process 'sendmail -f adm > david at DavidBook' returned 127 (errno 10:No child processes) > 05/19/2012 23:32:42;000d;PBS_Server;Job;167109.davidbook;Post job file > processing error; job 167109.davidbook on host MediaBox/3 > 05/19/2012 23:32:42;0040;PBS_Server;Svr;davidbook;Scheduler was sent > the command term > 05/19/2012 23:32:42;000d;PBS_Server;Job;167109.davidbook;Email 'o' to > david at DavidBook failed: Child process 'sendmail -f adm > david at DavidBook' returned 127 (errno 10:No child processes) > 05/19/2012 23:32:42;0010;PBS_Server;Job;167105.davidbook;Exit_status=-2 > resources_used.cput=00:00:00 resources_used.mem=0kb > resources_used.vmem=0kb resources_used.walltime=00:00:19 > 05/19/2012 23:32:42;000d;PBS_Server;Job;167105.davidbook;Email 'a' to > david at DavidBook failed: Child process 'sendmail -f adm > david at DavidBook' returned 127 (errno 10:No child processes) > 05/19/2012 23:32:43;0100;PBS_Server;Job;167109.davidbook;dequeuing > from batch, state COMPLETE > 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent > the command term > 05/19/2012 23:32:43;0100;PBS_Server;Job;167112.davidbook;enqueuing > into batch, state 1 hop 1 > 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Queued at > request of david at DavidBook, owner = david at DavidBook, job name = > VisionProcessManager, queue = batch > 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent > the command new > 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Modified > at request of Scheduler at DavidBook > 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Run at > request of Scheduler at DavidBook > 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent > the command recyc > 05/19/2012 23:32:43;0010;PBS_Server;Job;167112.davidbook;Exit_status=-2 > resources_used.cput=00:00:00 resources_used.mem=0kb > resources_used.vmem=0kb resources_used.walltime=00:00:01 > 05/19/2012 23:32:43;000d;PBS_Server;Job;167112.davidbook;Email 'a' to > david at DavidBook failed: Child process 'sendmail -f adm > david at DavidBook' returned 127 (errno 10:No child processes) > > And here's mom log: > 05/19/2012 23:32:29;0080; ? pbs_mom;Job;167104.davidbook;obit sent to server > 05/19/2012 23:32:31;0001; > pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, > FALSE > 05/19/2012 23:32:31;0001; ? pbs_mom;Job;TMomFinalizeJob3;job not > started, Failure job exec failure, after files staged, no retry (see > syslog for more information) > 05/19/2012 23:32:31;0008; ? pbs_mom;Req;send_sisters;sending ABORT to > sisters for job 167105.davidbook > 05/19/2012 23:32:31;0080; ? pbs_mom;Svr;preobit_reply;top of preobit_reply > 05/19/2012 23:32:31;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, > top of while loop > 05/19/2012 23:32:31;0080; ? pbs_mom;Svr;preobit_reply;in while loop, > no error from job stat > 05/19/2012 23:32:31;0001; ? pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation > now in progress (115) in post_epilogue, cannot connect to port 1023 in > client_to_svr - connection refused > 05/19/2012 23:32:33;0001; > pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, > FALSE > 05/19/2012 23:32:33;0001; ? pbs_mom;Job;TMomFinalizeJob3;job not > started, Failure job exec failure, after files staged, no retry (see > syslog for more information) > 05/19/2012 23:32:33;0008; ? pbs_mom;Req;send_sisters;sending ABORT to > sisters for job 167106.davidbook > 05/19/2012 23:32:33;0080; ? pbs_mom;Svr;preobit_reply;top of preobit_reply > 05/19/2012 23:32:33;0080; > pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, > top of while loop > 05/19/2012 23:32:33;0080; ? pbs_mom;Svr;preobit_reply;in while loop, > no error from job stat > 05/19/2012 23:32:33;0080; ? pbs_mom;Job;167106.davidbook;obit sent to server > 05/19/2012 23:32:36;0001; > pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, > FALSE > 05/19/2012 23:32:36;0001; ? pbs From jkusznir at gmail.com Mon May 21 11:47:51 2012 From: jkusznir at gmail.com (Jim Kusznir) Date: Mon, 21 May 2012 10:47:51 -0700 Subject: [torqueusers] Suspending Jobs? Message-ID: Hi all: I recently was made aware that SGE in theory has the ability to 'suspend' low priority jobs in favor of higher priority ones. This would potentially solve the common problem of the low-paying, heavy-using customer who can saturate a cluster and keep it that way, but the paying users who need timely access on short, smaller jobs would like to share. Can this be done with torque+maui these days? --Jim From david at unistra.fr Mon May 21 12:35:05 2012 From: david at unistra.fr (R. David) Date: Mon, 21 May 2012 20:35:05 +0200 Subject: [torqueusers] Suspending Jobs? In-Reply-To: References: Message-ID: <5D899BA5-4AF3-4B75-BA06-1CF0C8EC63FA@unistra.fr> Le 21 mai 2012 ? 19:47, Jim Kusznir a ?crit : > Hi all: > > I recently was made aware that SGE in theory has the ability to > 'suspend' low priority jobs in favor of higher priority ones. This > would potentially solve the common problem of the low-paying, > heavy-using customer who can saturate a cluster and keep it that way, > but the paying users who need timely access on short, smaller jobs > would like to share. > > Can this be done with torque+maui these days? Yes : defining QOS is MAUI that have either PREEMPTOR or PREEMPTEE flags. Then you would have to decide if you want to re-run jobs which are suspended, or just put the asleep. Furthermore, It the depends on the ability of the resource manager (Torque) to control processes correctly, in order to be able to stop/restart them. Regards, R. David From knielson at adaptivecomputing.com Mon May 21 15:28:06 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 21 May 2012 15:28:06 -0600 Subject: [torqueusers] Jobs have errors on client node. In-Reply-To: References: Message-ID: On Sat, May 19, 2012 at 10:21 PM, David Allor wrote: > Hi! I've played with torque before and understand a lot of the > configuration, but I am guessing I have missed something major in > setting up my client node. Every job which requests the server node > runs perfectly, but the queue list says error for every process which > requests the made-up 'vision' property for the client node. > > I keep getting this same lines over and over in syslog with a > different filename. > > May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::sys_copy, command > '/usr/bin/scp -rpB /var/spool/pbs/spool/167144.davidbook.ER > david at DavidBook://Camera.e167144' failed with status=1, giving up > after 4 attempts > May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::req_cpyfile, Unable to > copy file /var/spool/pbs/spool/167144.davidbook.ER to > david at DavidBook://Camera.e167144 > David, The problem is that the MOM where the job executed does not have copy privileges to the directory from where the job was submitted. You can remedy this in a few different ways. There are several administrators on this list who can better answer this than I can. However, you could start with setting up SSH keys between MOMs and submit hosts. Or you might try an nfs share. Again I will yield to those who set this stuff up for their users. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120521/e37f30b0/attachment.html From danielfcoimbra at gmail.com Mon May 21 15:29:00 2012 From: danielfcoimbra at gmail.com (Daniel Fernando Coimbra) Date: Mon, 21 May 2012 18:29:00 -0300 Subject: [torqueusers] Jobs have errors on client node. In-Reply-To: References: Message-ID: <4FBAB39C.7000505@gmail.com> I can tell that the notation "://" is not what scp expects. It's supposed to be "somemachine:somename" and in case somename equals "/something" this means to start at the filesystem root on placing the file, while otherwise it would start at the users home dir. The reason to this could be from something wrong on the "-e" and "-o" arguments to qsub. It seems like there is also some problems with your mail setup. Em 20-05-2012 01:24, David Allor escreveu: > For some reason part my last message came out in another alphabet. I > meant to say, of "david at DavidBook://Camera.e167144", where does that > point? Is it missing the directory? > > On Sun, May 20, 2012 at 12:21 AM, David Allor wrote: >> Hi! I've played with torque before and understand a lot of the >> configuration, but I am guessing I have missed something major in >> setting up my client node. Every job which requests the server node >> runs perfectly, but the queue list says error for every process which >> requests the made-up 'vision' property for the client node. >> >> I keep getting this same lines over and over in syslog with a >> different filename. >> >> May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::sys_copy, command >> '/usr/bin/scp -rpB /var/spool/pbs/spool/167144.davidbook.ER >> david at DavidBook://Camera.e167144' failed with status=1, giving up >> after 4 attempts >> May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::req_cpyfile, Unable to >> copy file /var/spool/pbs/spool/167144.davidbook.ER to >> david at DavidBook://Camera.e167144 >> What gets to me is the "david at DavidBook://Camera.e167144". ??? ??? >> ??? ???? ?? ?? ???? ?? ??????? >> >> Here's some server log: >> 05/19/2012 23:32:40;0010;PBS_Server;Job;167111.davidbook;Exit_status=-2 >> resources_used.cput=00:00:00 resources_used.mem=0kb >> resources_used.vmem=0kb resources_used.walltime=00:00:00 >> 05/19/2012 23:32:40;000d;PBS_Server;Job;167111.davidbook;Email 'a' to >> david at DavidBook failed: Child process 'sendmail -f adm >> david at DavidBook' returned 127 (errno 10:No child processes) >> 05/19/2012 23:32:42;000d;PBS_Server;Job;167109.davidbook;Post job file >> processing error; job 167109.davidbook on host MediaBox/3 >> 05/19/2012 23:32:42;0040;PBS_Server;Svr;davidbook;Scheduler was sent >> the command term >> 05/19/2012 23:32:42;000d;PBS_Server;Job;167109.davidbook;Email 'o' to >> david at DavidBook failed: Child process 'sendmail -f adm >> david at DavidBook' returned 127 (errno 10:No child processes) >> 05/19/2012 23:32:42;0010;PBS_Server;Job;167105.davidbook;Exit_status=-2 >> resources_used.cput=00:00:00 resources_used.mem=0kb >> resources_used.vmem=0kb resources_used.walltime=00:00:19 >> 05/19/2012 23:32:42;000d;PBS_Server;Job;167105.davidbook;Email 'a' to >> david at DavidBook failed: Child process 'sendmail -f adm >> david at DavidBook' returned 127 (errno 10:No child processes) >> 05/19/2012 23:32:43;0100;PBS_Server;Job;167109.davidbook;dequeuing >> from batch, state COMPLETE >> 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent >> the command term >> 05/19/2012 23:32:43;0100;PBS_Server;Job;167112.davidbook;enqueuing >> into batch, state 1 hop 1 >> 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Queued at >> request of david at DavidBook, owner = david at DavidBook, job name = >> VisionProcessManager, queue = batch >> 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent >> the command new >> 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Modified >> at request of Scheduler at DavidBook >> 05/19/2012 23:32:43;0008;PBS_Server;Job;167112.davidbook;Job Run at >> request of Scheduler at DavidBook >> 05/19/2012 23:32:43;0040;PBS_Server;Svr;davidbook;Scheduler was sent >> the command recyc >> 05/19/2012 23:32:43;0010;PBS_Server;Job;167112.davidbook;Exit_status=-2 >> resources_used.cput=00:00:00 resources_used.mem=0kb >> resources_used.vmem=0kb resources_used.walltime=00:00:01 >> 05/19/2012 23:32:43;000d;PBS_Server;Job;167112.davidbook;Email 'a' to >> david at DavidBook failed: Child process 'sendmail -f adm >> david at DavidBook' returned 127 (errno 10:No child processes) >> >> And here's mom log: >> 05/19/2012 23:32:29;0080; pbs_mom;Job;167104.davidbook;obit sent to server >> 05/19/2012 23:32:31;0001; >> pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, >> FALSE >> 05/19/2012 23:32:31;0001; pbs_mom;Job;TMomFinalizeJob3;job not >> started, Failure job exec failure, after files staged, no retry (see >> syslog for more information) >> 05/19/2012 23:32:31;0008; pbs_mom;Req;send_sisters;sending ABORT to >> sisters for job 167105.davidbook >> 05/19/2012 23:32:31;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply >> 05/19/2012 23:32:31;0080; >> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, >> top of while loop >> 05/19/2012 23:32:31;0080; pbs_mom;Svr;preobit_reply;in while loop, >> no error from job stat >> 05/19/2012 23:32:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation >> now in progress (115) in post_epilogue, cannot connect to port 1023 in >> client_to_svr - connection refused >> 05/19/2012 23:32:33;0001; >> pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, >> FALSE >> 05/19/2012 23:32:33;0001; pbs_mom;Job;TMomFinalizeJob3;job not >> started, Failure job exec failure, after files staged, no retry (see >> syslog for more information) >> 05/19/2012 23:32:33;0008; pbs_mom;Req;send_sisters;sending ABORT to >> sisters for job 167106.davidbook >> 05/19/2012 23:32:33;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply >> 05/19/2012 23:32:33;0080; >> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, >> top of while loop >> 05/19/2012 23:32:33;0080; pbs_mom;Svr;preobit_reply;in while loop, >> no error from job stat >> 05/19/2012 23:32:33;0080; pbs_mom;Job;167106.davidbook;obit sent to server >> 05/19/2012 23:32:36;0001; >> pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, >> FALSE >> 05/19/2012 23:32:36;0001; pbs > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Daniel Fernando Coimbra Grupo de Estrutura Eletr?nica Molecular Departamento de Qu?mica Universidade Federal de Santa Catarina From dwallor at gmail.com Mon May 21 18:10:19 2012 From: dwallor at gmail.com (David Allor) Date: Mon, 21 May 2012 20:10:19 -0400 Subject: [torqueusers] Jobs have errors on client node. In-Reply-To: References: Message-ID: I have gotten it to work by adding the -e and -o options to qsub. Using ssh works fine without a password, but the problem is the incorrect syntax for scp, '/usr/bin/scp -rpB /var/spool/pbs/spool/167144.davidbook.ER david at DavidBook://Camera.e167144'. "david at DavidBook://Camera.e167144" is obviously wrong. When I had this problem I was not using the -e or -o option on qsub or $usecp in mom_priv/config. I tried fixing this with $usecp and then with qsub -e /dir/stderrfile -o /dir/stdoutfile, and later method fixed it. So it's a work around and requires extra parameters with every usage of qsub, but I'm glad it works until I figure out problem. Thank you both for the responses. On Mon, May 21, 2012 at 5:28 PM, Ken Nielson wrote: > On Sat, May 19, 2012 at 10:21 PM, David Allor wrote: >> >> Hi! ?I've played with torque before and understand a lot of the >> configuration, but I am guessing I have missed something major in >> setting up my client node. ?Every job which requests the server node >> runs perfectly, but the queue list says error for every process which >> requests the made-up 'vision' property for the client node. >> >> I keep getting this same lines over and over in syslog with a >> different filename. >> >> May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::sys_copy, command >> '/usr/bin/scp -rpB /var/spool/pbs/spool/167144.davidbook.ER >> david at DavidBook://Camera.e167144' failed with status=1, giving up >> after 4 attempts >> May 19 23:34:01 MediaBox pbs_mom: LOG_ERROR::req_cpyfile, Unable to >> copy file /var/spool/pbs/spool/167144.davidbook.ER to >> david at DavidBook://Camera.e167144 > > > David, > > The problem is that the MOM where the job executed does not have copy > privileges to the directory from where the job was submitted. You can remedy > this in a few different ways. There are several administrators on this list > who can better answer this than I can. However, you could start with setting > up SSH keys between MOMs and submit hosts. Or you might try an nfs share. > Again I will yield to those who set this stuff up for their users. > > Regards > > Ken Nielson > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From jujj603 at gmail.com Mon May 21 19:34:57 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Tue, 22 May 2012 09:34:57 +0800 Subject: [torqueusers] Stageout parameters for qsub? In-Reply-To: References: Message-ID: perfix every stagein/out file with stagein=/stageout= respectively, in your case: > #PBS -W stagein=stagein1.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stagein1.txt, > *stagein=*stagein2.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stagein2.txt and #PBS -W stageout=/tmp/stageout1.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stageout1.txt, > *stageout=/*tmp/stageout2.txt at vmps55.vampire > :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stageout2.txt On Sat, May 19, 2012 at 6:05 AM, Andrew Melo wrote: > Hello all, > > I'm working on some code to integrate my experiment's software with our > cluster, and I'm having trouble figuring out the syntax for the -W stagein > and -W stageout arguments. The man page claims: > > stagein=file_list >> stageout=file_list >> Specifies which files are staged (copied) in before job >> start or staged out after the job completes execution. On completion of >> the job, >> all staged-in and staged-out files are removed from the >> execution system. The file_list is in the form >> local_file at hostname:remote_file[,...] >> regardless of the direction of the copy. > > > However, with these stanzas: > > #PBS -W stagein=stagein1.txt at vmps55.vampire >> :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stagein1.txt,stagein2.txt at vmps55.vampire >> :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stagein2.txt > > > and > > #PBS -W stageout=/tmp/stageout1.txt at vmps55.vampire >> :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stageout1.txt,/tmp/stageout2.txt at vmps55.vampire >> :/gpfs20/home/meloam/susy/shyft-v9-withtau/CMSSW_4_2_4/src/TopQuarkAnalysis/TopPairBSM/test/v1/stageout2.txt > > > I get > > vmps07:v1] qsub stageout.pbs >> qsub: illegal -W value > > > (this happens with either stanza and also with both of them in the same > PBS script) Am I missing something really obvious? If it helps, my qsub is > this version: > > vmps07:v1] qsub --version >> version: 2.5.9 > > > Thanks a lot! > Andrew > > > -- > -- > Andrew Melo > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120522/606096ce/attachment.html From lemgabri at gmail.com Wed May 23 09:25:31 2012 From: lemgabri at gmail.com (John Smith) Date: Wed, 23 May 2012 16:25:31 +0100 Subject: [torqueusers] TORQUE 4.0.0 does not allow submitting jobs as Message-ID: The problem is still unresolved with 4.0.1 and 4.0.2 !!!! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120523/c8d09d6c/attachment.html From knielson at adaptivecomputing.com Wed May 23 09:29:17 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 23 May 2012 09:29:17 -0600 Subject: [torqueusers] TORQUE 4.0.0 does not allow submitting jobs as In-Reply-To: References: Message-ID: On Wed, May 23, 2012 at 9:25 AM, John Smith wrote: > The problem is still unresolved with 4.0.1 and 4.0.2 !!!! > > _______________________________________________ > John, Did you remember to load trqauthd? If so please remind us of the details for this problem. Regards Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120523/ce92be9e/attachment.html From lemgabri at gmail.com Wed May 23 11:49:27 2012 From: lemgabri at gmail.com (John Smith) Date: Wed, 23 May 2012 18:49:27 +0100 Subject: [torqueusers] TORQUE 4.0.0 does not allow submitting jobs as Message-ID: Hi Ken, It seems that it's the same problem as here: http://www.clusterresources.com/pipermail/torqueusers/2012-March/014325.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120523/2aec5eac/attachment.html From arkaaloke at gmail.com Wed May 23 15:07:01 2012 From: arkaaloke at gmail.com (Arka Aloke Bhattacharya) Date: Wed, 23 May 2012 14:07:01 -0700 Subject: [torqueusers] Is there any statistic on how many HPC clusters throughout the world use Torque ? Message-ID: Hi, I was just making a few slides on some feature I added to Torque. However, I could not find any useful data on how many universities/HPC clusters use Torque as their HPC resource manager. Is there any website which could provide some statistics on this ? Arka. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120523/64920c34/attachment-0001.html From shantanugadgil at yahoo.com Fri May 25 01:56:50 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Fri, 25 May 2012 00:56:50 -0700 (PDT) Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] Message-ID: <1337932610.72446.YahooMailClassic@web120602.mail.ne1.yahoo.com> Hi, I was trying to build TORQUE 3.0.5 on CentOS x86_64 along with the "gui" parameter; like so: # rpmbuild --with gui -tb torque-3.0.5.tar.gz this kept failing with the "Tcl was requested but not found" error. I knew that I had all the necessary "tcl" and "tk" packages installed. (I had /already/ installed ALL the packages myself) # yum -y install "tcl-*" "tk-*" The error was appearing in spite of this!!! On a hunch I installed the "32 bit" devel packages to see if that helped: # yum -y install "tcl-devel.i686" "tk-devel.i686" Once the 32bitse and their dependencies were installed, I could get the GUI rpm built properly!!! torque-3.0.5-1.cri.x86_64.rpm torque-client-3.0.5-1.cri.x86_64.rpm torque-debuginfo-3.0.5-1.cri.x86_64.rpm torque-devel-3.0.5-1.cri.x86_64.rpm torque-gui-3.0.5-1.cri.x86_64.rpm torque-scheduler-3.0.5-1.cri.x86_64.rpm torque-server-3.0.5-1.cri.x86_64.rpm If this is not mentioned anywhere, maybe can be added to the configuration/build notes ?!? Thanks and regards, Shantanu Gadgil From knielson at adaptivecomputing.com Fri May 25 08:43:56 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 25 May 2012 08:43:56 -0600 Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <1337932610.72446.YahooMailClassic@web120602.mail.ne1.yahoo.com> References: <1337932610.72446.YahooMailClassic@web120602.mail.ne1.yahoo.com> Message-ID: On Fri, May 25, 2012 at 1:56 AM, Shantanu Gadgil wrote: > Hi, > > I was trying to build TORQUE 3.0.5 on CentOS x86_64 along with the "gui" > parameter; like so: > # rpmbuild --with gui -tb torque-3.0.5.tar.gz > > this kept failing with the "Tcl was requested but not found" error. > > I knew that I had all the necessary "tcl" and "tk" packages installed. > (I had /already/ installed ALL the packages myself) > > # yum -y install "tcl-*" "tk-*" > > The error was appearing in spite of this!!! > > On a hunch I installed the "32 bit" devel packages to see if that helped: > > # yum -y install "tcl-devel.i686" "tk-devel.i686" > > Once the 32bitse and their dependencies were installed, I could get the > GUI rpm built properly!!! > > torque-3.0.5-1.cri.x86_64.rpm > torque-client-3.0.5-1.cri.x86_64.rpm > torque-debuginfo-3.0.5-1.cri.x86_64.rpm > torque-devel-3.0.5-1.cri.x86_64.rpm > torque-gui-3.0.5-1.cri.x86_64.rpm > torque-scheduler-3.0.5-1.cri.x86_64.rpm > torque-server-3.0.5-1.cri.x86_64.rpm > > If this is not mentioned anywhere, maybe can be added to the > configuration/build notes ?!? > > Thanks and regards, > Shantanu Gadgil > _______________________________________________ > This is code we have not touched since I have been at Adaptive Computing. Is this something that should be updated to use newer 64 bit libraries? Or is just compiling against the 32 bit library enough? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120525/d45ec598/attachment.html From jgbaum at liai.org Fri May 25 09:54:44 2012 From: jgbaum at liai.org (J. Greenbaum) Date: Fri, 25 May 2012 08:54:44 -0700 (PDT) Subject: [torqueusers] torque rejecting jobs with codes 15001, 15056, & 15031 In-Reply-To: <075a21a9-1f69-473f-a5f7-eaf52b71b498@zimbra-1> Message-ID: <5c9a17d0-9c7e-4848-a96a-92f51ba93743@zimbra-1> Hi, I have an instance of galaxy (http://usegalaxy.org) running on our local cluster and I've been getting some strange errors when it submits jobs to the queue. I see reject codes 15001 & 15056 in the torque server log and code 15031 in the galaxy log. It happens only sporadically, but enough to keep us from using it as a reliable solution. I've googled around a bit and noticed that others have had similar issues, but I do not see a solution anywhere. It has been suggested that the problem could be arising because galaxy submits jobs in rapid succession. Any suggestions would be appreciated. The relevant sections of the torque server log and galaxy log files at the end of this message. Thanks in advance. -Jason -- Jason Greenbaum, Ph.D. Manager, Bioinformatics Core | jgbaum at liai.org La Jolla Institute for Allergy and Immunology Galaxy log: galaxy.jobs.manager DEBUG 2012-05-25 08:12:52,692 (91) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:52,694 (92) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:52,695 (93) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:52,696 (94) Job assigned to handler 'main' galaxy.jobs DEBUG 2012-05-25 08:12:58,001 (91) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/91 galaxy.jobs.handler DEBUG 2012-05-25 08:12:58,001 dispatching job 91 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:12:58,254 (91) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:12:58,340 (92) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/92 galaxy.jobs.handler DEBUG 2012-05-25 08:12:58,418 dispatching job 92 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:12:58,656 (92) Job dispatched galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:12:58,789 (91) submitting file /share/apps/galaxy-dist/database/pbs/91.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:12:58,791 (91) command is: python /share/apps/galaxy-dist/tools/solid_tools/solid_qual_stats.py /share/apps/galaxy-dist/database/files/000/dataset_5.dat /share/apps/galaxy-dist/database/files/000/dataset_152.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:12:58,792 (91) queued in default queue as 1469.herman.liai.org galaxy.jobs DEBUG 2012-05-25 08:12:58,816 (93) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/93 galaxy.jobs.handler DEBUG 2012-05-25 08:12:58,816 dispatching job 93 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:12:59,064 (93) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:12:59,206 (94) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/94 galaxy.jobs.handler DEBUG 2012-05-25 08:12:59,206 dispatching job 94 to pbs runner galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,433 (95) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,434 (96) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,435 (97) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,435 (98) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,436 (99) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,436 (100) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,437 (101) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,445 (102) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,456 (103) Job assigned to handler 'main' galaxy.jobs.manager DEBUG 2012-05-25 08:12:59,463 (104) Job assigned to handler 'main' galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:05,454 (91/1469.herman.liai.org) PBS job state changed from N to R galaxy.jobs.handler INFO 2012-05-25 08:13:05,491 (94) Job dispatched galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:05,561 (92) submitting file /share/apps/galaxy-dist/database/pbs/92.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:05,562 (92) command is: python /share/apps/galaxy-dist/tools/next_gen_conversion/solid2fastq.py --fr=/share/apps/galaxy-dist/database/files/000/dataset_4.dat --fq=/share/apps/galaxy-dist/database/files/000/dataset_5.dat --fout=/share/apps/galaxy-dist/database/files/000/dataset_153.dat -q 0 -t galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:05,584 (92) pbs_submit failed, PBS error 15031: Protocol (ASN.1) error galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,093 (93) submitting file /share/apps/galaxy-dist/database/pbs/93.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,093 (93) command is: python /share/apps/galaxy-dist/tools/next_gen_conversion/solid2fastq.py --fr=/share/apps/galaxy-dist/database/files/000/dataset_2.dat --fq=/share/apps/galaxy-dist/database/files/000/dataset_66.dat --fout=/share/apps/galaxy-dist/database/files/000/dataset_154.dat -q 0 -t galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,103 (93) queued in default queue as 1470.herman.liai.org galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,287 (94) submitting file /share/apps/galaxy-dist/database/pbs/94.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,289 (94) command is: python /share/apps/galaxy-dist/tools/solid_tools/solid_qual_stats.py /share/apps/galaxy-dist/database/files/000/dataset_66.dat /share/apps/galaxy-dist/database/files/000/dataset_155.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,291 (94) queued in default queue as 1471.herman.liai.org galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,911 (91/1469.herman.liai.org) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,912 (93/1470.herman.liai.org) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:06,915 (94/1471.herman.liai.org) PBS job has left queue galaxy.jobs DEBUG 2012-05-25 08:13:07,716 job 94 ended galaxy.jobs DEBUG 2012-05-25 08:13:07,717 job 93 ended galaxy.jobs DEBUG 2012-05-25 08:13:07,726 job 91 ended 10.0.6.167 - - [25/May/2012:08:12:50 -0700] "POST /workflow/run?id=f2db41e1fa331b3e HTTP/1.0" 200 - "http://herman:8081/workflow/run?id=f2db41e1fa331b3e" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0" 10.0.6.167 - - [25/May/2012:08:13:10 -0700] "GET /history HTTP/1.0" 200 - "http://herman:8081/workflow/run?id=f2db41e1fa331b3e" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0" 10.0.6.167 - - [25/May/2012:08:13:11 -0700] "POST /root/user_get_usage HTTP/1.0" 200 - "http://herman:8081/history" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0" galaxy.jobs DEBUG 2012-05-25 08:13:11,158 (95) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/95 galaxy.jobs.handler DEBUG 2012-05-25 08:13:11,158 dispatching job 95 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:13:11,225 (95) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:13:11,304 (96) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/96 galaxy.jobs.handler DEBUG 2012-05-25 08:13:11,305 dispatching job 96 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:13:11,415 (96) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:13:11,567 (97) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/97 galaxy.jobs.handler DEBUG 2012-05-25 08:13:11,582 dispatching job 97 to pbs runner galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:11,729 (95) submitting file /share/apps/galaxy-dist/database/pbs/95.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:11,729 (95) command is: python /share/apps/galaxy-dist/tools/solid_tools/solid_qual_stats.py /share/apps/galaxy-dist/database/files/000/dataset_7.dat /share/apps/galaxy-dist/database/files/000/dataset_156.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:11,730 (95) queued in default queue as 1472.herman.liai.org galaxy.jobs.handler INFO 2012-05-25 08:13:11,817 (97) Job dispatched galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:17,898 (95/1472.herman.liai.org) PBS job state changed from N to Q galaxy.jobs DEBUG 2012-05-25 08:13:17,937 (98) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/98 galaxy.jobs.handler DEBUG 2012-05-25 08:13:17,937 dispatching job 98 to pbs runner galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:17,956 (96) submitting file /share/apps/galaxy-dist/database/pbs/96.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:17,957 (96) command is: python /share/apps/galaxy-dist/tools/next_gen_conversion/solid2fastq.py --fr=/share/apps/galaxy-dist/database/files/000/dataset_6.dat --fq=/share/apps/galaxy-dist/database/files/000/dataset_7.dat --fout=/share/apps/galaxy-dist/database/files/000/dataset_157.dat -q 0 -t galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:17,957 (96) pbs_submit failed, PBS error 15031: Protocol (ASN.1) error galaxy.jobs.manager DEBUG 2012-05-25 08:13:17,999 (105) Job assigned to handler 'main' galaxy.jobs.handler INFO 2012-05-25 08:13:18,188 (98) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:13:18,505 (99) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/99 galaxy.jobs.handler DEBUG 2012-05-25 08:13:18,505 dispatching job 99 to pbs runner galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,764 (97) submitting file /share/apps/galaxy-dist/database/pbs/97.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,764 (97) command is: python /share/apps/galaxy-dist/tools/next_gen_conversion/solid2fastq.py --fr=/share/apps/galaxy-dist/database/files/000/dataset_65.dat --fq=/share/apps/galaxy-dist/database/files/000/dataset_66.dat --fout=/share/apps/galaxy-dist/database/files/000/dataset_158.dat -q 0 -t galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,766 (97) queued in default queue as 1473.herman.liai.org galaxy.jobs.handler INFO 2012-05-25 08:13:18,789 (99) Job dispatched 10.0.6.167 - - [25/May/2012:08:13:17 -0700] "POST /root/history_item_updates HTTP/1.0" 200 - "http://herman:8081/history" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0" galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,873 (98) submitting file /share/apps/galaxy-dist/database/pbs/98.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,873 (98) command is: python /share/apps/galaxy-dist/tools/solid_tools/solid_qual_stats.py /share/apps/galaxy-dist/database/files/000/dataset_66.dat /share/apps/galaxy-dist/database/files/000/dataset_159.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,875 (98) queued in default queue as 1474.herman.liai.org galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:18,901 (95/1472.herman.liai.org) PBS job has left queue galaxy.jobs DEBUG 2012-05-25 08:13:18,984 (100) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/100 galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:19,238 (99) submitting file /share/apps/galaxy-dist/database/pbs/99.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:19,238 (99) command is: bash /share/apps/galaxy-dist/tools/solid_tools/qualsolid_boxplot_graph.sh -t 's1_F3_quality_stats.txt' -i /share/apps/galaxy-dist/database/files/000/dataset_152.dat -o /share/apps/galaxy-dist/database/files/000/dataset_160.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:19,239 (99) queued in default queue as 1475.herman.liai.org galaxy.jobs DEBUG 2012-05-25 08:13:19,407 job 95 ended galaxy.jobs.handler INFO 2012-05-25 08:13:19,422 (100) Job unable to run: one or more inputs in error state galaxy.jobs DEBUG 2012-05-25 08:13:19,453 (101) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/101 galaxy.jobs.handler DEBUG 2012-05-25 08:13:19,453 dispatching job 101 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:13:19,554 (101) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:13:19,647 (102) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/102 galaxy.jobs.handler DEBUG 2012-05-25 08:13:19,647 dispatching job 102 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:13:19,738 (102) Job dispatched galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:25,816 (97/1473.herman.liai.org) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:25,821 (98/1474.herman.liai.org) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:25,822 (99/1475.herman.liai.org) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:25,861 (101) submitting file /share/apps/galaxy-dist/database/pbs/101.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:25,861 (101) command is: bash /share/apps/galaxy-dist/tools/solid_tools/qualsolid_boxplot_graph.sh -t 's1_F5_quality_stats.txt' -i /share/apps/galaxy-dist/database/files/000/dataset_155.dat -o /share/apps/galaxy-dist/database/files/000/dataset_163.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:25,862 (101) pbs_submit failed, PBS error 15031: Protocol (ASN.1) error galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:26,137 (102) submitting file /share/apps/galaxy-dist/database/pbs/102.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:26,137 (102) command is: bash /share/apps/galaxy-dist/tools/solid_tools/qualsolid_boxplot_graph.sh -t 's2_F3_quality_stats.txt' -i /share/apps/galaxy-dist/database/files/000/dataset_156.dat -o /share/apps/galaxy-dist/database/files/000/dataset_164.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:26,141 (102) queued in default queue as 1476.herman.liai.org galaxy.jobs DEBUG 2012-05-25 08:13:26,528 job 97 ended galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:26,830 (102/1476.herman.liai.org) PBS job state changed from N to Q galaxy.jobs DEBUG 2012-05-25 08:13:26,852 job 99 ended galaxy.jobs DEBUG 2012-05-25 08:13:26,881 job 98 ended galaxy.jobs DEBUG 2012-05-25 08:13:27,094 (103) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/103 10.0.6.167 - - [25/May/2012:08:13:25 -0700] "POST /root/history_item_updates HTTP/1.0" 200 - "http://herman:8081/history" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0" galaxy.jobs.handler INFO 2012-05-25 08:13:27,274 (103) Job unable to run: one or more inputs in error state galaxy.jobs DEBUG 2012-05-25 08:13:27,294 (104) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/104 galaxy.jobs.handler DEBUG 2012-05-25 08:13:27,294 dispatching job 104 to pbs runner galaxy.jobs.handler INFO 2012-05-25 08:13:27,352 (104) Job dispatched galaxy.jobs DEBUG 2012-05-25 08:13:27,440 (105) Working directory for job is: /share/apps/galaxy-dist/database/job_working_directory/000/105 10.0.6.167 - - [25/May/2012:08:13:27 -0700] "POST /root/history_get_disk_size HTTP/1.0" 200 - "http://herman:8081/history" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0" galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:27,816 (104) submitting file /share/apps/galaxy-dist/database/pbs/104.sh galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:27,817 (104) command is: bash /share/apps/galaxy-dist/tools/solid_tools/qualsolid_boxplot_graph.sh -t 's2_F5_quality_stats.txt' -i /share/apps/galaxy-dist/database/files/000/dataset_159.dat -o /share/apps/galaxy-dist/database/files/000/dataset_167.dat galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:27,818 (104) queued in default queue as 1477.herman.liai.org galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:27,835 (102/1476.herman.liai.org) PBS job has left queue galaxy.jobs DEBUG 2012-05-25 08:13:28,153 job 102 ended galaxy.jobs.handler INFO 2012-05-25 08:13:28,653 (105) Job unable to run: one or more inputs in error state galaxy.jobs.runners.pbs DEBUG 2012-05-25 08:13:28,837 (104/1477.herman.liai.org) PBS job has left queue galaxy.jobs DEBUG 2012-05-25 08:13:28,972 job 104 ended Torque server log: 05/25/2012 08:12:58;0100;PBS_Server;Job;1469.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:12:58;0008;PBS_Server;Job;1469.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 91 _solid_qual_stats_jgbaum at liai.org, queue = default 05/25/2012 08:12:58;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:12:59;0008;PBS_Server;Job;1469.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:12:59;0008;PBS_Server;Job;1469.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:12:59;000d;PBS_Server;Job;1469.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:05;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 05/25/2012 08:13:05;0080;PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 05/25/2012 08:13:05;000d;PBS_Server;Job;1469.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:05;0010;PBS_Server;Job;1469.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:06 05/25/2012 08:13:05;0100;PBS_Server;Job;1469.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:05;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:06;0100;PBS_Server;Job;1470.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:06;0008;PBS_Server;Job;1470.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 93 _solid2fastq_jgbaum at liai.org, queue = default 05/25/2012 08:13:06;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:06;0008;PBS_Server;Job;1470.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:06;0008;PBS_Server;Job;1470.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:06;000d;PBS_Server;Job;1470.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:06;0100;PBS_Server;Job;1471.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:06;0008;PBS_Server;Job;1471.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 94 _solid_qual_stats_jgbaum at liai.org, queue = default 05/25/2012 08:13:06;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:06;0008;PBS_Server;Job;1471.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:06;0008;PBS_Server;Job;1471.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:06;000d;PBS_Server;Job;1470.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:06;0010;PBS_Server;Job;1470.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:00 05/25/2012 08:13:06;0100;PBS_Server;Job;1470.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:06;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:06;000d;PBS_Server;Job;1471.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:06;000d;PBS_Server;Job;1471.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:06;0010;PBS_Server;Job;1471.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:00 05/25/2012 08:13:06;0100;PBS_Server;Job;1471.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:06;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:06;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:06;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:06;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:11;0100;PBS_Server;Job;1472.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:11;0008;PBS_Server;Job;1472.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 95 _solid_qual_stats_jgbaum at liai.org, queue = default 05/25/2012 08:13:11;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:17;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 05/25/2012 08:13:17;0080;PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 05/25/2012 08:13:17;0008;PBS_Server;Job;1472.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:17;0008;PBS_Server;Job;1472.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:17;000d;PBS_Server;Job;1472.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:18;000d;PBS_Server;Job;1472.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:18;0010;PBS_Server;Job;1472.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:01 05/25/2012 08:13:18;0100;PBS_Server;Job;1472.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:18;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:18;0100;PBS_Server;Job;1473.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:18;0008;PBS_Server;Job;1473.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 97 _solid2fastq_jgbaum at liai.org, queue = default 05/25/2012 08:13:18;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:18;0100;PBS_Server;Job;1474.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:18;0008;PBS_Server;Job;1474.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 98 _solid_qual_stats_jgbaum at liai.org, queue = default 05/25/2012 08:13:18;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:18;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:18;0008;PBS_Server;Job;1473.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:18;0008;PBS_Server;Job;1473.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:18;0008;PBS_Server;Job;1474.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:18;0008;PBS_Server;Job;1474.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:18;000d;PBS_Server;Job;1473.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:18;000d;PBS_Server;Job;1474.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:19;000d;PBS_Server;Job;1474.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:19;0010;PBS_Server;Job;1474.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:01 05/25/2012 08:13:19;0100;PBS_Server;Job;1474.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:19;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:19;000d;PBS_Server;Job;1473.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:19;0010;PBS_Server;Job;1473.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:01 05/25/2012 08:13:19;0100;PBS_Server;Job;1473.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:19;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:19;0100;PBS_Server;Job;1475.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:19;0008;PBS_Server;Job;1475.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 99 _solid_qual_boxplot_jgbaum at liai.org, queue = default 05/25/2012 08:13:19;0008;PBS_Server;Job;1475.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:19;0008;PBS_Server;Job;1475.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:19;000d;PBS_Server;Job;1475.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:19;000d;PBS_Server;Job;1475.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:19;0010;PBS_Server;Job;1475.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:00 05/25/2012 08:13:19;0100;PBS_Server;Job;1475.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:19;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:25;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 05/25/2012 08:13:25;0080;PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 05/25/2012 08:13:25;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:25;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:25;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:26;0100;PBS_Server;Job;1476.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:26;0008;PBS_Server;Job;1476.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 10 2_solid_qual_boxplot_jgbaum at liai.org, queue = default 05/25/2012 08:13:26;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:27;0008;PBS_Server;Job;1476.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:27;0008;PBS_Server;Job;1476.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:27;000d;PBS_Server;Job;1476.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:27;000d;PBS_Server;Job;1476.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:27;0010;PBS_Server;Job;1476.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:00 05/25/2012 08:13:27;0100;PBS_Server;Job;1476.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:27;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:27;0100;PBS_Server;Job;1477.herman.liai.org;enqueuing into default, state 1 hop 1 05/25/2012 08:13:27;0008;PBS_Server;Job;1477.herman.liai.org;Job Queued at request of galaxy at herman.liai.org, owner = galaxy at herman.liai.org, job name = 10 4_solid_qual_boxplot_jgbaum at liai.org, queue = default 05/25/2012 08:13:27;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command new 05/25/2012 08:13:27;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org 05/25/2012 08:13:27;0008;PBS_Server;Job;1477.herman.liai.org;Job Modified at request of maui at herman.liai.org 05/25/2012 08:13:27;0008;PBS_Server;Job;1477.herman.liai.org;Job Run at request of maui at herman.liai.org 05/25/2012 08:13:28;000d;PBS_Server;Job;1477.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:28;000d;PBS_Server;Job;1477.herman.liai.org;Not sending email: Default mailpoint does not include this type. 05/25/2012 08:13:28;0010;PBS_Server;Job;1477.herman.liai.org;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb reso urces_used.walltime=00:00:00 05/25/2012 08:13:28;0100;PBS_Server;Job;1477.herman.liai.org;dequeuing from default, state COMPLETE 05/25/2012 08:13:28;0040;PBS_Server;Svr;herman.liai.org;Scheduler was sent the command term 05/25/2012 08:13:28;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=StatusJob, from galaxy at herman.liai.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120525/17c562d6/attachment-0001.html From danielfcoimbra at gmail.com Sat May 26 23:00:41 2012 From: danielfcoimbra at gmail.com (Daniel Fernando Coimbra) Date: Sun, 27 May 2012 02:00:41 -0300 Subject: [torqueusers] NOT run in a node Message-ID: <4FC1B4F9.60802@gmail.com> We have a node here with less disk space than others. This is usually not a matter of concern, however I'd like to be able to request that a this node NOT be used upon submission of a specific job, is there any way to accomplish that, other than giving all other nodes a feature and requesting it for all jobs except these? It would really be nice if one could use "-l nodes=1:!nodeX" os something like that. -- Daniel Fernando Coimbra Grupo de Estrutura Eletr?nica Molecular Departamento de Qu?mica Universidade Federal de Santa Catarina From shantanugadgil at yahoo.com Mon May 28 02:10:19 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Mon, 28 May 2012 01:10:19 -0700 (PDT) Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: Message-ID: <1338192619.41868.YahooMailClassic@web120601.mail.ne1.yahoo.com> Hello Ken, I was just presenting "what worked for me" :) What I think should be happening is that the "configure" script should detect correctly that the 64bit "devel" packages ARE installed and lookup the 'tclConfig.sh', etc from "/usr/lib64" rather than bailing with a not found message. Although I was able to dig into the build process a little bit, I am not sure what changes would be exactly required to make the thing work! >From my side it was more of a "symptom and 'workaround'" kind of solution. After "feeding" the system the 32bit devel packages, I was thinking that there would a mix of 64 and 32 bit binaries somewhere (not checked for these) The correct thing would be that there would be only 64 bit binaries when building on a 64 bit machine. Regards, Shantanu --- On Fri, 5/25/12, Ken Nielson wrote: From: Ken Nielson Subject: Re: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] To: "Torque Users Mailing List" Date: Friday, May 25, 2012, 8:13 PM On Fri, May 25, 2012 at 1:56 AM, Shantanu Gadgil wrote: Hi, I was trying to build TORQUE 3.0.5 on CentOS x86_64 along with the "gui" parameter; like so: # rpmbuild --with gui -tb torque-3.0.5.tar.gz this kept failing with the "Tcl was requested but not found" error. I knew that I had all the necessary "tcl" and "tk" packages installed. (I had /already/ installed ALL the packages myself) # yum ?-y install "tcl-*" "tk-*" The error was appearing in spite of this!!! On a hunch I installed the "32 bit" devel packages to see if that helped: # yum ?-y install "tcl-devel.i686" "tk-devel.i686" Once the 32bitse and their dependencies were installed, I could get the GUI rpm built properly!!! torque-3.0.5-1.cri.x86_64.rpm torque-client-3.0.5-1.cri.x86_64.rpm torque-debuginfo-3.0.5-1.cri.x86_64.rpm torque-devel-3.0.5-1.cri.x86_64.rpm torque-gui-3.0.5-1.cri.x86_64.rpm torque-scheduler-3.0.5-1.cri.x86_64.rpm torque-server-3.0.5-1.cri.x86_64.rpm If this is not mentioned anywhere, maybe can be added to the configuration/build notes ?!? Thanks and regards, Shantanu Gadgil _______________________________________________ This is code we have not touched since I have been at Adaptive Computing. Is this something that should be updated to use newer 64 bit libraries? Or is just compiling against the 32 bit library enough? Ken -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120528/8afde8c9/attachment.html From siegert at sfu.ca Mon May 28 13:29:54 2012 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 28 May 2012 12:29:54 -0700 Subject: [torqueusers] torque server: timing out connections Message-ID: <20120528192954.GC20824@stikine.sfu.ca> Hi, I am having the following problem right now: the torque server became unresponsive (qstat and qsub time out) and I needed to restart it. Now the server appears to go through every and each connection that was established before the restart and times it out: strace shows: connect(1092, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("172.18.1.0")}, 16) = -1 EINPROGRESS (Operation now in progress) select(1093, NULL, [1092], NULL, {5, 0}) = 0 (Timeout) select(1093, NULL, [1092], NULL, {5, 0}) = 0 (Timeout) nanosleep({0, 1000000}, {0, 1000000}) = 0 close(1092) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 1092 fcntl(1092, F_GETFL) = 0x2 (flags O_RDWR) fcntl(1092, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(1092, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(1092, {sa_family=AF_INET, sin_port=htons(619), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 connect(1092, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("172.18.1.0")}, 16) = -1 EINPROGRESS (Operation now in progress) select(1093, NULL, [1092], NULL, {5, 0}) = 0 (Timeout) select(1093, NULL, [1092], NULL, {5, 0}) = 0 (Timeout) nanosleep({0, 1000000}, {0, 1000000}) = 0 close(1092) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 1092 fcntl(1092, F_GETFL) = 0x2 (flags O_RDWR) fcntl(1092, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(1092, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(1092, {sa_family=AF_INET, sin_port=htons(620), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 connect(1092, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("172.18.1.0")}, 16) = -1 EINPROGRESS (Operation now in progress) select(1093, NULL, [1092], NULL, {5, 0}) = 0 (Timeout) select(1093, NULL, [1092], NULL, {5, 0}) = 0 (Timeout) nanosleep({0, 1000000}, {0, 1000000}) = 0 close(1092) = 0 etc. This goes on for several hours already. Is there a way to speed this up? Cheers, Martin -- Martin Siegert Simon Fraser University Burnaby, British Columbia From roy.dragseth at cc.uit.no Tue May 29 01:41:00 2012 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Tue, 29 May 2012 09:41 +0200 Subject: [torqueusers] torque server: timing out connections In-Reply-To: <20120528192954.GC20824@stikine.sfu.ca> References: <20120528192954.GC20824@stikine.sfu.ca> Message-ID: <7197651.kavNGebP1U@newton.cc.uit.no> On Monday, May 28, 2012 12:29:54 Martin Siegert wrote: > 172.18.1.0 This is most likely the culprit, try to reboot this node. We see the same behaviour occasionally when nodes are partly up or there are some problems with the connection to the node so you start to loose tcp-packages. r. -- The Computer Center, University of Troms?, N-9037 TROMS? Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no From dbeer at adaptivecomputing.com Tue May 29 09:37:59 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 29 May 2012 09:37:59 -0600 Subject: [torqueusers] NOT run in a node In-Reply-To: <4FC1B4F9.60802@gmail.com> References: <4FC1B4F9.60802@gmail.com> Message-ID: Daniel, What scheduler are you using? Moab has a feature that works almost exactly like that (I don't know if it works in Maui or not). David On Sat, May 26, 2012 at 11:00 PM, Daniel Fernando Coimbra < danielfcoimbra at gmail.com> wrote: > We have a node here with less disk space than others. This is usually > not a matter of concern, however I'd like to be able to request that a > this node NOT be used upon submission of a specific job, is there any > way to accomplish that, other than giving all other nodes a feature and > requesting it for all jobs except these? > It would really be nice if one could use "-l nodes=1:!nodeX" os > something like that. > -- > Daniel Fernando Coimbra > Grupo de Estrutura Eletr?nica Molecular > Departamento de Qu?mica > Universidade Federal de Santa Catarina > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120529/8109e3f5/attachment.html From dbeer at adaptivecomputing.com Tue May 29 09:59:41 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 29 May 2012 09:59:41 -0600 Subject: [torqueusers] torque server: timing out connections In-Reply-To: <7197651.kavNGebP1U@newton.cc.uit.no> References: <20120528192954.GC20824@stikine.sfu.ca> <7197651.kavNGebP1U@newton.cc.uit.no> Message-ID: Martin, Is this a 4.0.* version of TORQUE? It is possible that you have found a case of deadlock. If you attach to pbs_server you should be able to determine this - gdb attach `pgrep pbs_server` Once the prompt comes up in gdb: info threads and you can see if you have a thread blocked on a mutex. There is a little bit more you can do to be 100% sure that it won't recover, but this can give a pretty good idea of things. David On Tue, May 29, 2012 at 1:41 AM, Roy Dragseth wrote: > On Monday, May 28, 2012 12:29:54 Martin Siegert wrote: > > 172.18.1.0 > > This is most likely the culprit, try to reboot this node. We see the same > behaviour occasionally when nodes are partly up or there are some problems > with the connection to the node so you start to loose tcp-packages. > > r. > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120529/e1eb90d8/attachment-0001.html From danielfcoimbra at gmail.com Tue May 29 10:01:33 2012 From: danielfcoimbra at gmail.com (Daniel Fernando Coimbra) Date: Tue, 29 May 2012 13:01:33 -0300 Subject: [torqueusers] NOT run in a node In-Reply-To: References: <4FC1B4F9.60802@gmail.com> Message-ID: <4FC4F2DD.8050207@gmail.com> I use Maui. I could not find anything like that in the docs, but I will give it a second look at the Moab docs to try this out then. Thank you. On 29-05-2012 12:37, David Beer wrote: > Daniel, > > What scheduler are you using? Moab has a feature that works almost > exactly like that (I don't know if it works in Maui or not). > > David > > On Sat, May 26, 2012 at 11:00 PM, Daniel Fernando Coimbra > > wrote: > > We have a node here with less disk space than others. This is usually > not a matter of concern, however I'd like to be able to request that a > this node NOT be used upon submission of a specific job, is there any > way to accomplish that, other than giving all other nodes a feature and > requesting it for all jobs except these? > It would really be nice if one could use "-l nodes=1:!nodeX" os > something like that. > -- > Daniel Fernando Coimbra > Grupo de Estrutura Eletr?nica Molecular > Departamento de Qu?mica > Universidade Federal de Santa Catarina > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Daniel Fernando Coimbra Grupo de Estrutura Eletr?nica Molecular Departamento de Qu?mica Universidade Federal de Santa Catarina From siegert at sfu.ca Tue May 29 11:30:50 2012 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 29 May 2012 10:30:50 -0700 Subject: [torqueusers] torque server: timing out connections In-Reply-To: <7197651.kavNGebP1U@newton.cc.uit.no> References: <20120528192954.GC20824@stikine.sfu.ca> <7197651.kavNGebP1U@newton.cc.uit.no> Message-ID: <20120529173050.GA25034@stikine.sfu.ca> Hi Roy, sorry - I should have mentioned that - 172.18.1.0 is the torque server - rebooting it was the first thing we tried. Absolutely no effect. By now our system is down completely. The server being busy timing out connections to itself ... Cheers, Martin On Tue, May 29, 2012 at 09:41:00AM +0200, Roy Dragseth wrote: > On Monday, May 28, 2012 12:29:54 Martin Siegert wrote: > > 172.18.1.0 > > This is most likely the culprit, try to reboot this node. We see the same > behaviour occasionally when nodes are partly up or there are some problems > with the connection to the node so you start to loose tcp-packages. > > r. > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From siegert at sfu.ca Tue May 29 11:50:55 2012 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 29 May 2012 10:50:55 -0700 Subject: [torqueusers] torque server: timing out connections In-Reply-To: References: <20120528192954.GC20824@stikine.sfu.ca> <7197651.kavNGebP1U@newton.cc.uit.no> Message-ID: <20120529175055.GB25034@stikine.sfu.ca> Hi David, no this is 2.5.11 (I am running 4.0.2 on our test system). Unfortunately this is the production system. And it is completely unresponsive by now. (I just created a ticket #13100). Cheers, Martin On Tue, May 29, 2012 at 09:59:41AM -0600, David Beer wrote: > > Martin, > > Is this a 4.0.* version of TORQUE? It is possible that you have found a > case of deadlock. If you attach to pbs_server you should be able to > determine this - > > gdb attach `pgrep pbs_server` > > Once the prompt comes up in gdb: > > info threads > > and you can see if you have a thread blocked on a mutex. There is a > little bit more you can do to be 100% sure that it won't recover, but > this can give a pretty good idea of things. > > David > On Tue, May 29, 2012 at 1:41 AM, Roy Dragseth > <[1]roy.dragseth at cc.uit.no> wrote: > > On Monday, May 28, 2012 12:29:54 Martin Siegert wrote: > > 172.18.1.0 > This is most likely the culprit, try to reboot this node. We see > the same > behaviour occasionally when nodes are partly up or there are some > problems > with the connection to the node so you start to loose tcp-packages. > r. > -- > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > phone:[2]+47 77 64 41 07, fax:[3]+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: [4]+47 77 64 62 56. email: > [5]roy.dragseth at uit.no > > _______________________________________________ > torqueusers mailing list > [6]torqueusers at supercluster.org > [7]http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > David Beer | Software Engineer > Adaptive Computing > > References > > 1. mailto:roy.dragseth at cc.uit.no > 2. tel:%2B47%2077%2064%2041%2007 > 3. tel:%2B47%2077%2064%2041%2000 > 4. tel:%2B47%2077%2064%2062%2056 > 5. mailto:roy.dragseth at uit.no > 6. mailto:torqueusers at supercluster.org > 7. http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Tue May 29 14:09:57 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 29 May 2012 14:09:57 -0600 Subject: [torqueusers] torque server: timing out connections In-Reply-To: <20120529175055.GB25034@stikine.sfu.ca> References: <20120528192954.GC20824@stikine.sfu.ca> <7197651.kavNGebP1U@newton.cc.uit.no> <20120529175055.GB25034@stikine.sfu.ca> Message-ID: You should continue working with support on that, but I have an idea. We found an issue around 2.5.7 or so where there's a potential 4+ hour hang in pbs_server. Essentially, there exists a situation where pbs_server can retry about 880 ports when trying to connect to a host, each one with potentially over 18 seconds of wait time, resulting in over 4 hours of hang. This is a long standing bug in TORQUE that is technically working as designed, so the fix is a configure option: --with-tcp-retry-limit=5 (or some other value if you like). This says that if it can't connect after trying 5 ports then to just move on. You should be able to just replace the pbs_server binary for this one to see if it resolves it. I am suggesting this solution because at least 2 other sites had similar symptoms and this ended up being the problem. On a side note, if you're wondering why this isn't the default, I suggested that we make this the default but I didn't get support from the user's list and therefore the change wasn't made. David On Tue, May 29, 2012 at 11:50 AM, Martin Siegert wrote: > Hi David, > > no this is 2.5.11 (I am running 4.0.2 on our test system). > Unfortunately this is the production system. And it is completely > unresponsive by now. (I just created a ticket #13100). > > Cheers, > Martin > > On Tue, May 29, 2012 at 09:59:41AM -0600, David Beer wrote: > > > > Martin, > > > > Is this a 4.0.* version of TORQUE? It is possible that you have found > a > > case of deadlock. If you attach to pbs_server you should be able to > > determine this - > > > > gdb attach `pgrep pbs_server` > > > > Once the prompt comes up in gdb: > > > > info threads > > > > and you can see if you have a thread blocked on a mutex. There is a > > little bit more you can do to be 100% sure that it won't recover, but > > this can give a pretty good idea of things. > > > > David > > On Tue, May 29, 2012 at 1:41 AM, Roy Dragseth > > <[1]roy.dragseth at cc.uit.no> wrote: > > > > On Monday, May 28, 2012 12:29:54 Martin Siegert wrote: > > > 172.18.1.0 > > This is most likely the culprit, try to reboot this node. We see > > the same > > behaviour occasionally when nodes are partly up or there are some > > problems > > with the connection to the node so you start to loose tcp-packages. > > r. > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? Norway. > > phone:[2]+47 77 64 41 07, fax:[3]+47 77 64 41 00 > > Roy Dragseth, Team Leader, High Performance Computing > > Direct call: [4]+47 77 64 62 56. email: > > [5]roy.dragseth at uit.no > > > > _______________________________________________ > > torqueusers mailing list > > [6]torqueusers at supercluster.org > > [7]http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > References > > > > 1. mailto:roy.dragseth at cc.uit.no > > 2. tel:%2B47%2077%2064%2041%2007 > > 3. tel:%2B47%2077%2064%2041%2000 > > 4. tel:%2B47%2077%2064%2062%2056 > > 5. mailto:roy.dragseth at uit.no > > 6. mailto:torqueusers at supercluster.org > > 7. http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120529/d95a7dc3/attachment.html From siegert at sfu.ca Tue May 29 14:36:15 2012 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 29 May 2012 13:36:15 -0700 Subject: [torqueusers] torque server: timing out connections In-Reply-To: References: <20120528192954.GC20824@stikine.sfu.ca> <7197651.kavNGebP1U@newton.cc.uit.no> <20120529175055.GB25034@stikine.sfu.ca> Message-ID: <20120529203615.GA31910@stikine.sfu.ca> Hi David, I will definitely add --with-tcp-retry-limit=5 to my configure options, since we did run into exactly that situation. However, the current situation is due to an ip mismatch between private and public ip address of the torque server: svr_connect.c, line 172 if ((hostaddr == pbs_server_addr) && (port == pbs_server_port_dis)) { return(PBS_LOCAL_CONNECTION); /* special value for local */ } In our case: hostaddr = 172.18.1.0 and pbs_server_addr = 206.12.24.2. The former ip address is the (correct) ip address on the internal cluster network, the latter ip address is the public ip address and should not be used by torque anywhere. We have in /etc/hosts 172.18.1.0 b0 and then set the server name in 4 (!!) different places: 1) in qmgr we have set server server_name = b0 2) /var/spool/torque/server_name contains b0 3) /var/spool/torque/torque.cfg contains SERVERHOST b0 4) we configure with --with-default-server=b0 I always thought that it should be sufficient to set this once. Obviously I am wrong ... I am missing at least a fifth spot where I need to set this: how to I get torque server to set pbs_server_addr in svr_connect to 172.18.1.0? For now we used the following workaround: 1) in /etc/hosts set 172.18.1.0 hostname.domain.ca hostname b0 2) restart torque server and wait a few seconds until qstat, etc. responds. 3) change /etc/hosts back to 172.18.1.0 b0 This does "solve" the problem for now. I am still looking for a more permanent solution. Cheers, Martin On Tue, May 29, 2012 at 02:09:57PM -0600, David Beer wrote: > > You should continue working with support on that, but I have an idea. > We found an issue around 2.5.7 or so where there's a potential 4+ hour > hang in pbs_server. Essentially, there exists a situation where > pbs_server can retry about 880 ports when trying to connect to a host, > each one with potentially over 18 seconds of wait time, resulting in > over 4 hours of hang. This is a long standing bug in TORQUE that is > technically working as designed, so the fix is a configure option: > > --with-tcp-retry-limit=5 (or some other value if you like). This says > that if it can't connect after trying 5 ports then to just move on. You > should be able to just replace the pbs_server binary for this one to > see if it resolves it. I am suggesting this solution because at least 2 > other sites had similar symptoms and this ended up being the problem. > > On a side note, if you're wondering why this isn't the default, I > suggested that we make this the default but I didn't get support from > the user's list and therefore the change wasn't made. > > David > On Tue, May 29, 2012 at 11:50 AM, Martin Siegert <[1]siegert at sfu.ca> > wrote: > > Hi David, > no this is 2.5.11 (I am running 4.0.2 on our test system). > Unfortunately this is the production system. And it is completely > unresponsive by now. (I just created a ticket #13100). > Cheers, > Martin > > On Tue, May 29, 2012 at 09:59:41AM -0600, David Beer wrote: > > > > Martin, > > > > Is this a 4.0.* version of TORQUE? It is possible that you have > found a > > case of deadlock. If you attach to pbs_server you should be able > to > > determine this - > > > > gdb attach `pgrep pbs_server` > > > > Once the prompt comes up in gdb: > > > > info threads > > > > and you can see if you have a thread blocked on a mutex. There is > a > > little bit more you can do to be 100% sure that it won't recover, > but > > this can give a pretty good idea of things. > > > > David > > On Tue, May 29, 2012 at 1:41 AM, Roy Dragseth > > > <[1][2]roy.dragseth at cc.uit.no> wrote: > > > > On Monday, May 28, 2012 12:29:54 Martin Siegert wrote: > > > 172.18.1.0 > > This is most likely the culprit, try to reboot this node. We > see > > the same > > behaviour occasionally when nodes are partly up or there are > some > > problems > > with the connection to the node so you start to loose > tcp-packages. > > r. > > -- > > The Computer Center, University of Troms?, N-9037 TROMS? > Norway. > > > phone:[2][3]+47 77 64 41 07, fax:[3][4]+47 77 64 > 41 00 > > > Roy Dragseth, Team Leader, High Performance Computing > > > Direct call: [4][5]+47 77 64 62 56. email: > > [5][6]roy.dragseth at uit.no > > > > _______________________________________________ > > torqueusers mailing list > > [6][7]torqueusers at supercluster.org > > [7][8]http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > > References > > > > 1. mailto:[9]roy.dragseth at cc.uit.no > > 2. tel:%2B47%2077%2064%2041%2007 > > 3. tel:%2B47%2077%2064%2041%2000 > > 4. tel:%2B47%2077%2064%2062%2056 > > 5. mailto:[10]roy.dragseth at uit.no > > 6. mailto:[11]torqueusers at supercluster.org > > 7. [12]http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > > torqueusers mailing list > > [13]torqueusers at supercluster.org > > [14]http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > [15]torqueusers at supercluster.org > [16]http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > David Beer | Software Engineer > Adaptive Computing > > References > > 1. mailto:siegert at sfu.ca > 2. mailto:roy.dragseth at cc.uit.no > 3. tel:%2B47%2077%2064%2041%2007 > 4. tel:%2B47%2077%2064%2041%2000 > 5. tel:%2B47%2077%2064%2062%2056 > 6. mailto:roy.dragseth at uit.no > 7. mailto:torqueusers at supercluster.org > 8. http://www.supercluster.org/mailman/listinfo/torqueusers > 9. mailto:roy.dragseth at cc.uit.no > 10. mailto:roy.dragseth at uit.no > 11. mailto:torqueusers at supercluster.org > 12. http://www.supercluster.org/mailman/listinfo/torqueusers > 13. mailto:torqueusers at supercluster.org > 14. http://www.supercluster.org/mailman/listinfo/torqueusers > 15. mailto:torqueusers at supercluster.org > 16. http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From mej at lbl.gov Tue May 29 17:06:32 2012 From: mej at lbl.gov (Michael Jennings) Date: Tue, 29 May 2012 16:06:32 -0700 Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <1337932610.72446.YahooMailClassic@web120602.mail.ne1.yahoo.com> References: <1337932610.72446.YahooMailClassic@web120602.mail.ne1.yahoo.com> Message-ID: <20120529230631.GL26901@lbl.gov> On Friday, 25 May 2012, at 00:56:50 (-0700), Shantanu Gadgil wrote: > I was trying to build TORQUE 3.0.5 on CentOS x86_64 along with the "gui" parameter; like so: > # rpmbuild --with gui -tb torque-3.0.5.tar.gz > > this kept failing with the "Tcl was requested but not found" error. > > I knew that I had all the necessary "tcl" and "tk" packages installed. > (I had /already/ installed ALL the packages myself) > > # yum -y install "tcl-*" "tk-*" > > The error was appearing in spite of this!!! > > On a hunch I installed the "32 bit" devel packages to see if that helped: > > # yum -y install "tcl-devel.i686" "tk-devel.i686" > > Once the 32bitse and their dependencies were installed, I could get the GUI rpm built properly!!! > > torque-3.0.5-1.cri.x86_64.rpm > torque-client-3.0.5-1.cri.x86_64.rpm > torque-debuginfo-3.0.5-1.cri.x86_64.rpm > torque-devel-3.0.5-1.cri.x86_64.rpm > torque-gui-3.0.5-1.cri.x86_64.rpm > torque-scheduler-3.0.5-1.cri.x86_64.rpm > torque-server-3.0.5-1.cri.x86_64.rpm > > If this is not mentioned anywhere, maybe can be added to the configuration/build notes ?!? I cannot reproduce this with either 2.5.11 or 4.0.2. Both build fine with no 32-bit tcl or tk packages installed on both RHEL5 and RHEL6. Have you tried building directly from the tarball instead of building RPMs? Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From shantanugadgil at yahoo.com Tue May 29 22:07:35 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Tue, 29 May 2012 21:07:35 -0700 (PDT) Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <20120529230631.GL26901@lbl.gov> Message-ID: <1338350855.72099.YahooMailClassic@web120604.mail.ne1.yahoo.com> Hello Michael, Extracting the tar and then building ".sh" install packages works fine! The problem with using the .sh based installers is that they overwrite the config file(s); 'server_priv/nodes', 'mom_priv/config', etc. Not /really/ a big problem, just have to be careful while "upgrading" any system! * Using the RPM it creates a ".rpmnew" file from the previous one. * Also, if I untar, configure, make and then 'make rpm', I hit the same 'TCl not found ...' error. Regards, Shantanu --- On Wed, 5/30/12, Michael Jennings wrote: > From: Michael Jennings > Subject: Re: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] > To: torqueusers at supercluster.org > Date: Wednesday, May 30, 2012, 4:36 AM > On Friday, 25 May 2012, at 00:56:50 > (-0700), > Shantanu Gadgil wrote: > > > I was trying to build TORQUE 3.0.5 on CentOS x86_64 > along with the "gui" parameter; like so: > > # rpmbuild --with gui -tb torque-3.0.5.tar.gz > > > > this kept failing with the "Tcl was requested but not > found" error. > > > > I knew that I had all the necessary "tcl" and "tk" > packages installed. > > (I had /already/ installed ALL the packages myself) > > > > # yum? -y install "tcl-*" "tk-*" > > > > The error was appearing in spite of this!!! > > > > On a hunch I installed the "32 bit" devel packages to > see if that helped: > > > > # yum? -y install "tcl-devel.i686" > "tk-devel.i686" > > > > Once the 32bitse and their dependencies were installed, > I could get the GUI rpm built properly!!! > > > > torque-3.0.5-1.cri.x86_64.rpm > > torque-client-3.0.5-1.cri.x86_64.rpm > > torque-debuginfo-3.0.5-1.cri.x86_64.rpm > > torque-devel-3.0.5-1.cri.x86_64.rpm > > torque-gui-3.0.5-1.cri.x86_64.rpm > > torque-scheduler-3.0.5-1.cri.x86_64.rpm > > torque-server-3.0.5-1.cri.x86_64.rpm > > > > If this is not mentioned anywhere, maybe can be added > to the configuration/build notes ?!? > > I cannot reproduce this with either 2.5.11 or 4.0.2.? > Both build fine > with no 32-bit tcl or tk packages installed on both RHEL5 > and RHEL6. > > Have you tried building directly from the tarball instead of > building RPMs? > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E? ? ? ? W: 510-495-2687 > MS 050B-3209? ? ? ? ? F: > 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From mej at lbl.gov Wed May 30 15:48:48 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 30 May 2012 14:48:48 -0700 Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <1338350855.72099.YahooMailClassic@web120604.mail.ne1.yahoo.com> References: <20120529230631.GL26901@lbl.gov> <1338350855.72099.YahooMailClassic@web120604.mail.ne1.yahoo.com> Message-ID: <20120530214847.GU26901@lbl.gov> On Tuesday, 29 May 2012, at 21:07:35 (-0700), Shantanu Gadgil wrote: > Extracting the tar and then building ".sh" install packages works fine! Even if you specify --enable-gui --with-tcl when invoking ./configure? > The problem with using the .sh based installers is that they > overwrite the config file(s); 'server_priv/nodes', > 'mom_priv/config', etc. Trust me, I was not intending to argue the merits of RPM installations. I install RPMs too. :-) Just trying to figure out why it works fine for me but doesn't work for you. > * Also, if I untar, configure, make and then 'make rpm', I hit the > same 'TCl not found ...' error. That's odd. The GUI options are not passed to "make rpm" AFAIK, so it shouldn't be trying to build the torque-gui RPM at all in that case. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From shantanugadgil at yahoo.com Wed May 30 23:58:08 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Wed, 30 May 2012 22:58:08 -0700 (PDT) Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <20120530214847.GU26901@lbl.gov> Message-ID: <1338443888.67579.YahooMailClassic@web120603.mail.ne1.yahoo.com> Hello, --- On Thu, 5/31/12, Michael Jennings wrote: > From: Michael Jennings > Subject: Re: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] > Even if you specify --enable-gui --with-tcl when invoking > ./configure? Shantanu: Yes, I use "./configure --prefix=/usr --enable-gui --with-tcl --with-tk" ('/usr' is just my preference, that's all) > > > * Also, if I untar, configure, make and then 'make > rpm', I hit the > > same 'TCl not found ...' error. > > That's odd.? The GUI options are not passed to "make > rpm" AFAIK, so it > shouldn't be trying to build the torque-gui RPM at all in > that case. Sorry for the confusion; when creating the .sh based installers, I do give the params as mentioned above. So the params on my end would be: # ./configure --prefix=/usr --enable-gui --with-tcl --with-tk && make && make rpm Regards, Shantanu From shantanugadgil at yahoo.com Thu May 31 00:03:56 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Wed, 30 May 2012 23:03:56 -0700 (PDT) Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <20120530214847.GU26901@lbl.gov> Message-ID: <1338444236.91765.YahooMailClassic@web120602.mail.ne1.yahoo.com> Sorry for being click happy in the last posting ... When I execute the command: # ./configure --prefix=/usr --enable-gui --with-tcl --with-tk && make ... and then ... # make rpm I expected that the gui RPM would get created, but it was never created!!! (It makes sense now, from your explanation, that the GUI options not being passed to the make rpm) --- On Thu, 5/31/12, Michael Jennings wrote: > From: Michael Jennings > Subject: Re: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] > To: torqueusers at supercluster.org > Date: Thursday, May 31, 2012, 3:18 AM > On Tuesday, 29 May 2012, at 21:07:35 > (-0700), > Shantanu Gadgil wrote: > > > Extracting the tar and then building ".sh" install > packages works fine! > > Even if you specify --enable-gui --with-tcl when invoking > ./configure? > > > The problem with using the .sh based installers is that > they > > overwrite the config file(s); 'server_priv/nodes', > > 'mom_priv/config', etc. > > Trust me, I was not intending to argue the merits of RPM > installations.? I install RPMs too.? :-) > > Just trying to figure out why it works fine for me but > doesn't work > for you. > > > * Also, if I untar, configure, make and then 'make > rpm', I hit the > > same 'TCl not found ...' error. > > That's odd.? The GUI options are not passed to "make > rpm" AFAIK, so it > shouldn't be trying to build the torque-gui RPM at all in > that case. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E? ? ? ? W: 510-495-2687 > MS 050B-3209? ? ? ? ? F: > 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From walid.shaari at gmail.com Thu May 31 07:11:29 2012 From: walid.shaari at gmail.com (Walid) Date: Thu, 31 May 2012 16:11:29 +0300 Subject: [torqueusers] Is there any statistic on how many HPC clusters throughout the world use Torque ? In-Reply-To: References: Message-ID: we have recently move from Torque/Maui to UNIVA, you will find lots of academic centers either using Torque (mostly), SGE, or Condor, Enterprise tend to use either Moab, LSF, or UGE On 24 May 2012 00:07, Arka Aloke Bhattacharya wrote: > Hi, > > I was just making a few slides on some feature I added to Torque. > However, I could not find any useful data on how many universities/HPC > clusters use Torque as their HPC resource manager. > > Is there any website which could provide some statistics on this ? > > Arka. > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120531/86b7191d/attachment-0001.html From kunalgrao at gmail.com Thu May 31 08:53:04 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 31 May 2012 10:53:04 -0400 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: Hello, Please see the below message. I had posted it on maui users mailing list, but did not get any response, so thought of posting it here on torque users mailing list (incase someone would know). Kindly let me know if you have any comments / ideas / suggestions. Thanks, Kunal ---------- Forwarded message ---------- From: Kunal Rao Date: Wed, May 23, 2012 at 2:30 PM Subject: Re: Multi-req job not starting To: mauiusers at supercluster.org There was a similar post earlier : http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html But did not find any response to it. Can anyone please provide some ideas / suggestion on this issue. Thanks, Kunal On Wed, May 23, 2012 at 2:26 PM, Kunal Rao wrote: > Hello, > > I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( with 1 > task per node ), another which needs 4 nodes (with 1 task per node) and the > third one which needs 4 nodes ( with 2 task on 1 node and 1 task each on > the other 3 nodes ). > > Additional configuration in maui.cfg is : > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > ENABLEMULTIREQJOBS TRUE > NODEALLOCATIONPOLICY MINRESOURCE > NODEACCESSPOLICY SINGLEJOB > JOBNODEMATCHPOLICY EXACTNODE > > I am observing that if the first 2 jobs are running, the third one does > not start ( even though 4 nodes are available ) until 1 of the jobs > complete. With checkjob -v it shows the following output : > > ------------------ > > checking job 5791 (RM job '5791.fire16.csa.local') > > State: Idle > Creds: user:kunal group:kunal class:batch qos:DEFAULT > WallTime: 00:00:00 of 00:04:51 > SubmitTime: Wed May 23 11:52:04 > (Time Queued Total: 00:48:52 Eligible: 00:48:52) > > StartDate: 00:00:01 Wed May 23 12:40:57 > Total Tasks: 2 > > Req[0] TaskCount: 2 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 > NodeAccess: SINGLEJOB > TasksPerNode: 2 NodeCount: 1 > > Req[1] TaskCount: 3 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 > NodeAccess: SINGLEJOB > NodeCount: 3 > > > IWD: [NONE] Executable: [NONE] > Bypass: 5 StartCount: 0 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) > PE: 5.00 StartPriority: 48 > cannot select job 5791 for partition DEFAULT (startdate in '00:00:01') > > ------------ > > What could be the reason for not starting this job ? How do I resolve this > ? > > Thanks, > Kunal > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120531/ed0be0d5/attachment.html From gas5x at yahoo.com Thu May 31 14:28:01 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Thu, 31 May 2012 13:28:01 -0700 (PDT) Subject: [torqueusers] checkpointable jobs lose environment variables? Message-ID: <1338496081.8869.YahooMailClassic@web111313.mail.gq1.yahoo.com> Hi All, I have tried to install the BLCR checkpoint/restart (0.8.4) -enabled Torque (2.5.11), on a few old CentOS 5 machines we have (kernels 2.6.18.308, 2.6.18.194). I have built Torque with --enable-blcr switch, and the BLCR was installed as a system RPM (to /usr/bin etc.). The simple seconds-counting test seem to work. However, an user application test failed, the reason being unaccessible environment modules. I've checked with 'env' command and found, that while normal 'qsub' passes all the environment, 'qsub -c' does not. The job script was really minimal. #!/bin/bash #PBS -N test #PBS -l procs=2,walltime=21:10,mem=2mb #PBS -r y #PBS -S /bin/bash # env cd $PBS_O_WORKDIR ./test.x # done Results of 'env' differ, that for 'qsub -c' almost only $PBS_* things are passed, while for 'qsub' there would be everything. Could you please tell whether it is a desired behaviour or a bug, or is there a way to pass environment explicitly for 'qsub -c'? Thank you very much! -- Grigory Shamov HPC Analyst, University of Manitoba From jujj603 at gmail.com Thu May 31 18:31:45 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 1 Jun 2012 08:31:45 +0800 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: Please give your queue/server configuration and your job's resources need, cpu/memory etc. And Does all the 10 nodes accessable? You can use pbsnodes to check this. On Thu, May 31, 2012 at 10:53 PM, Kunal Rao wrote: > Hello, > > Please see the below message. I had posted it on maui users mailing list, > but did not get any response, so thought of posting it here on torque users > mailing list (incase someone would know). Kindly let me know if you have > any comments / ideas / suggestions. > > Thanks, > Kunal > > ---------- Forwarded message ---------- > From: Kunal Rao > Date: Wed, May 23, 2012 at 2:30 PM > Subject: Re: Multi-req job not starting > To: mauiusers at supercluster.org > > > There was a similar post earlier : > http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html > > But did not find any response to it. Can anyone please provide some ideas > / suggestion on this issue. > > Thanks, > Kunal > > > On Wed, May 23, 2012 at 2:26 PM, Kunal Rao wrote: > >> Hello, >> >> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( with >> 1 task per node ), another which needs 4 nodes (with 1 task per node) and >> the third one which needs 4 nodes ( with 2 task on 1 node and 1 task each >> on the other 3 nodes ). >> >> Additional configuration in maui.cfg is : >> >> BACKFILLPOLICY FIRSTFIT >> RESERVATIONPOLICY CURRENTHIGHEST >> >> ENABLEMULTIREQJOBS TRUE >> NODEALLOCATIONPOLICY MINRESOURCE >> NODEACCESSPOLICY SINGLEJOB >> JOBNODEMATCHPOLICY EXACTNODE >> >> I am observing that if the first 2 jobs are running, the third one does >> not start ( even though 4 nodes are available ) until 1 of the jobs >> complete. With checkjob -v it shows the following output : >> >> ------------------ >> >> checking job 5791 (RM job '5791.fire16.csa.local') >> >> State: Idle >> Creds: user:kunal group:kunal class:batch qos:DEFAULT >> WallTime: 00:00:00 of 00:04:51 >> SubmitTime: Wed May 23 11:52:04 >> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >> >> StartDate: 00:00:01 Wed May 23 12:40:57 >> Total Tasks: 2 >> >> Req[0] TaskCount: 2 Partition: ALL >> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >> Opsys: [NONE] Arch: [NONE] Features: [NONE] >> Exec: '' ExecSize: 0 ImageSize: 0 >> Dedicated Resources Per Task: PROCS: 1 >> NodeAccess: SINGLEJOB >> TasksPerNode: 2 NodeCount: 1 >> >> Req[1] TaskCount: 3 Partition: ALL >> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >> Opsys: [NONE] Arch: [NONE] Features: [NONE] >> Exec: '' ExecSize: 0 ImageSize: 0 >> Dedicated Resources Per Task: PROCS: 1 >> NodeAccess: SINGLEJOB >> NodeCount: 3 >> >> >> IWD: [NONE] Executable: [NONE] >> Bypass: 5 StartCount: 0 >> PartitionMask: [ALL] >> Flags: RESTARTABLE >> >> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >> PE: 5.00 StartPriority: 48 >> cannot select job 5791 for partition DEFAULT (startdate in '00:00:01') >> >> ------------ >> >> What could be the reason for not starting this job ? How do I resolve >> this ? >> >> Thanks, >> Kunal >> > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/a6e2e44b/attachment.html From mej at lbl.gov Thu May 31 18:38:57 2012 From: mej at lbl.gov (Michael Jennings) Date: Thu, 31 May 2012 17:38:57 -0700 Subject: [torqueusers] rpmbuild TORQUE 3.0.5 with GUI problem on x86_64 [SOLVED] In-Reply-To: <1338443888.67579.YahooMailClassic@web120603.mail.ne1.yahoo.com> References: <20120530214847.GU26901@lbl.gov> <1338443888.67579.YahooMailClassic@web120603.mail.ne1.yahoo.com> Message-ID: <20120601003856.GE26901@lbl.gov> On Wednesday, 30 May 2012, at 22:58:08 (-0700), Shantanu Gadgil wrote: > Yes, I use "./configure --prefix=/usr --enable-gui --with-tcl --with-tk" > > ('/usr' is just my preference, that's all) Can you send the output of the "rpmbuild" that fails? Or pastebin it somewhere and send the URL? Thanks, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From kunalgrao at gmail.com Thu May 31 18:54:53 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 31 May 2012 20:54:53 -0400 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: Queue / Server configuration : --------------- qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = fire16 set server acl_roots = root at fire16.csa.local set server managers = root at fire16.csa.local set server operators = root at fire16.csa.local set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 20 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server allow_node_submit = True set server next_job_number = 6331 --------------- Job resource requirement : --------- #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 --------- "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all accessible. Thanks, Kunal On 5/31/12, Ju JiaJia wrote: > Please give your queue/server configuration and your job's resources need, > cpu/memory etc. And Does all the 10 nodes accessable? You can use pbsnodes > to check this. > > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao wrote: > >> Hello, >> >> Please see the below message. I had posted it on maui users mailing list, >> but did not get any response, so thought of posting it here on torque >> users >> mailing list (incase someone would know). Kindly let me know if you have >> any comments / ideas / suggestions. >> >> Thanks, >> Kunal >> >> ---------- Forwarded message ---------- >> From: Kunal Rao >> Date: Wed, May 23, 2012 at 2:30 PM >> Subject: Re: Multi-req job not starting >> To: mauiusers at supercluster.org >> >> >> There was a similar post earlier : >> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >> >> But did not find any response to it. Can anyone please provide some ideas >> / suggestion on this issue. >> >> Thanks, >> Kunal >> >> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao wrote: >> >>> Hello, >>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( with >>> 1 task per node ), another which needs 4 nodes (with 1 task per node) >>> and >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 task >>> each >>> on the other 3 nodes ). >>> >>> Additional configuration in maui.cfg is : >>> >>> BACKFILLPOLICY FIRSTFIT >>> RESERVATIONPOLICY CURRENTHIGHEST >>> >>> ENABLEMULTIREQJOBS TRUE >>> NODEALLOCATIONPOLICY MINRESOURCE >>> NODEACCESSPOLICY SINGLEJOB >>> JOBNODEMATCHPOLICY EXACTNODE >>> >>> I am observing that if the first 2 jobs are running, the third one does >>> not start ( even though 4 nodes are available ) until 1 of the jobs >>> complete. With checkjob -v it shows the following output : >>> >>> ------------------ >>> >>> checking job 5791 (RM job '5791.fire16.csa.local') >>> >>> State: Idle >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >>> WallTime: 00:00:00 of 00:04:51 >>> SubmitTime: Wed May 23 11:52:04 >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >>> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >>> Total Tasks: 2 >>> >>> Req[0] TaskCount: 2 Partition: ALL >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> Dedicated Resources Per Task: PROCS: 1 >>> NodeAccess: SINGLEJOB >>> TasksPerNode: 2 NodeCount: 1 >>> >>> Req[1] TaskCount: 3 Partition: ALL >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> Dedicated Resources Per Task: PROCS: 1 >>> NodeAccess: SINGLEJOB >>> NodeCount: 3 >>> >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 5 StartCount: 0 >>> PartitionMask: [ALL] >>> Flags: RESTARTABLE >>> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >>> PE: 5.00 StartPriority: 48 >>> cannot select job 5791 for partition DEFAULT (startdate in '00:00:01') >>> >>> ------------ >>> >>> What could be the reason for not starting this job ? How do I resolve >>> this ? >>> >>> Thanks, >>> Kunal >>> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > From jujj603 at gmail.com Thu May 31 19:54:59 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 1 Jun 2012 09:54:59 +0800 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: How many cores on each of the 10 nodes ? I mean you are trying to allocate 2 processors on one node. And how did you configure TORQUE_HOME/server_priv/nodes ? On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao wrote: > Queue / Server configuration : > > --------------- > > qmgr -c 'p s' > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = fire16 > set server acl_roots = root at fire16.csa.local > set server managers = root at fire16.csa.local > set server operators = root at fire16.csa.local > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 20 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server allow_node_submit = True > set server next_job_number = 6331 > > --------------- > > Job resource requirement : > > --------- > > #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 > > --------- > > "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all > accessible. > > Thanks, > Kunal > > > On 5/31/12, Ju JiaJia wrote: > > Please give your queue/server configuration and your job's resources > need, > > cpu/memory etc. And Does all the 10 nodes accessable? You can use > pbsnodes > > to check this. > > > > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao wrote: > > > >> Hello, > >> > >> Please see the below message. I had posted it on maui users mailing > list, > >> but did not get any response, so thought of posting it here on torque > >> users > >> mailing list (incase someone would know). Kindly let me know if you have > >> any comments / ideas / suggestions. > >> > >> Thanks, > >> Kunal > >> > >> ---------- Forwarded message ---------- > >> From: Kunal Rao > >> Date: Wed, May 23, 2012 at 2:30 PM > >> Subject: Re: Multi-req job not starting > >> To: mauiusers at supercluster.org > >> > >> > >> There was a similar post earlier : > >> > http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html > >> > >> But did not find any response to it. Can anyone please provide some > ideas > >> / suggestion on this issue. > >> > >> Thanks, > >> Kunal > >> > >> > >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao wrote: > >> > >>> Hello, > >>> > >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( > with > >>> 1 task per node ), another which needs 4 nodes (with 1 task per node) > >>> and > >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 task > >>> each > >>> on the other 3 nodes ). > >>> > >>> Additional configuration in maui.cfg is : > >>> > >>> BACKFILLPOLICY FIRSTFIT > >>> RESERVATIONPOLICY CURRENTHIGHEST > >>> > >>> ENABLEMULTIREQJOBS TRUE > >>> NODEALLOCATIONPOLICY MINRESOURCE > >>> NODEACCESSPOLICY SINGLEJOB > >>> JOBNODEMATCHPOLICY EXACTNODE > >>> > >>> I am observing that if the first 2 jobs are running, the third one does > >>> not start ( even though 4 nodes are available ) until 1 of the jobs > >>> complete. With checkjob -v it shows the following output : > >>> > >>> ------------------ > >>> > >>> checking job 5791 (RM job '5791.fire16.csa.local') > >>> > >>> State: Idle > >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT > >>> WallTime: 00:00:00 of 00:04:51 > >>> SubmitTime: Wed May 23 11:52:04 > >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) > >>> > >>> StartDate: 00:00:01 Wed May 23 12:40:57 > >>> Total Tasks: 2 > >>> > >>> Req[0] TaskCount: 2 Partition: ALL > >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] > >>> Exec: '' ExecSize: 0 ImageSize: 0 > >>> Dedicated Resources Per Task: PROCS: 1 > >>> NodeAccess: SINGLEJOB > >>> TasksPerNode: 2 NodeCount: 1 > >>> > >>> Req[1] TaskCount: 3 Partition: ALL > >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] > >>> Exec: '' ExecSize: 0 ImageSize: 0 > >>> Dedicated Resources Per Task: PROCS: 1 > >>> NodeAccess: SINGLEJOB > >>> NodeCount: 3 > >>> > >>> > >>> IWD: [NONE] Executable: [NONE] > >>> Bypass: 5 StartCount: 0 > >>> PartitionMask: [ALL] > >>> Flags: RESTARTABLE > >>> > >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) > >>> PE: 5.00 StartPriority: 48 > >>> cannot select job 5791 for partition DEFAULT (startdate in '00:00:01') > >>> > >>> ------------ > >>> > >>> What could be the reason for not starting this job ? How do I resolve > >>> this ? > >>> > >>> Thanks, > >>> Kunal > >>> > >> > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/fbeab934/attachment.html From kunalgrao at gmail.com Thu May 31 19:59:33 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 31 May 2012 21:59:33 -0400 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each of the 10 nodes : np=16 gpus=1 Thanks, Kunal On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia wrote: > How many cores on each of the 10 nodes ? I mean you are trying to allocate > 2 processors on one node. And how did you > configure TORQUE_HOME/server_priv/nodes ? > > > On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao wrote: > >> Queue / Server configuration : >> >> --------------- >> >> qmgr -c 'p s' >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch resources_default.nodes = 1 >> set queue batch resources_default.walltime = 01:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = fire16 >> set server acl_roots = root at fire16.csa.local >> set server managers = root at fire16.csa.local >> set server operators = root at fire16.csa.local >> set server default_queue = batch >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 20 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server mom_job_sync = True >> set server keep_completed = 300 >> set server allow_node_submit = True >> set server next_job_number = 6331 >> >> --------------- >> >> Job resource requirement : >> >> --------- >> >> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 >> >> --------- >> >> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all >> accessible. >> >> Thanks, >> Kunal >> >> >> On 5/31/12, Ju JiaJia wrote: >> > Please give your queue/server configuration and your job's resources >> need, >> > cpu/memory etc. And Does all the 10 nodes accessable? You can use >> pbsnodes >> > to check this. >> > >> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao >> wrote: >> > >> >> Hello, >> >> >> >> Please see the below message. I had posted it on maui users mailing >> list, >> >> but did not get any response, so thought of posting it here on torque >> >> users >> >> mailing list (incase someone would know). Kindly let me know if you >> have >> >> any comments / ideas / suggestions. >> >> >> >> Thanks, >> >> Kunal >> >> >> >> ---------- Forwarded message ---------- >> >> From: Kunal Rao >> >> Date: Wed, May 23, 2012 at 2:30 PM >> >> Subject: Re: Multi-req job not starting >> >> To: mauiusers at supercluster.org >> >> >> >> >> >> There was a similar post earlier : >> >> >> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >> >> >> >> But did not find any response to it. Can anyone please provide some >> ideas >> >> / suggestion on this issue. >> >> >> >> Thanks, >> >> Kunal >> >> >> >> >> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao >> wrote: >> >> >> >>> Hello, >> >>> >> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( >> with >> >>> 1 task per node ), another which needs 4 nodes (with 1 task per node) >> >>> and >> >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 task >> >>> each >> >>> on the other 3 nodes ). >> >>> >> >>> Additional configuration in maui.cfg is : >> >>> >> >>> BACKFILLPOLICY FIRSTFIT >> >>> RESERVATIONPOLICY CURRENTHIGHEST >> >>> >> >>> ENABLEMULTIREQJOBS TRUE >> >>> NODEALLOCATIONPOLICY MINRESOURCE >> >>> NODEACCESSPOLICY SINGLEJOB >> >>> JOBNODEMATCHPOLICY EXACTNODE >> >>> >> >>> I am observing that if the first 2 jobs are running, the third one >> does >> >>> not start ( even though 4 nodes are available ) until 1 of the jobs >> >>> complete. With checkjob -v it shows the following output : >> >>> >> >>> ------------------ >> >>> >> >>> checking job 5791 (RM job '5791.fire16.csa.local') >> >>> >> >>> State: Idle >> >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >> >>> WallTime: 00:00:00 of 00:04:51 >> >>> SubmitTime: Wed May 23 11:52:04 >> >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >> >>> >> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >> >>> Total Tasks: 2 >> >>> >> >>> Req[0] TaskCount: 2 Partition: ALL >> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >> >>> Exec: '' ExecSize: 0 ImageSize: 0 >> >>> Dedicated Resources Per Task: PROCS: 1 >> >>> NodeAccess: SINGLEJOB >> >>> TasksPerNode: 2 NodeCount: 1 >> >>> >> >>> Req[1] TaskCount: 3 Partition: ALL >> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >> >>> Exec: '' ExecSize: 0 ImageSize: 0 >> >>> Dedicated Resources Per Task: PROCS: 1 >> >>> NodeAccess: SINGLEJOB >> >>> NodeCount: 3 >> >>> >> >>> >> >>> IWD: [NONE] Executable: [NONE] >> >>> Bypass: 5 StartCount: 0 >> >>> PartitionMask: [ALL] >> >>> Flags: RESTARTABLE >> >>> >> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >> >>> PE: 5.00 StartPriority: 48 >> >>> cannot select job 5791 for partition DEFAULT (startdate in '00:00:01') >> >>> >> >>> ------------ >> >>> >> >>> What could be the reason for not starting this job ? How do I resolve >> >>> this ? >> >>> >> >>> Thanks, >> >>> Kunal >> >>> >> >> >> >> >> >> >> >> _______________________________________________ >> >> torqueusers mailing list >> >> torqueusers at supercluster.org >> >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> >> > >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120531/57bcf8dc/attachment-0001.html From jujj603 at gmail.com Thu May 31 20:23:55 2012 From: jujj603 at gmail.com (Ju JiaJia) Date: Fri, 1 Jun 2012 10:23:55 +0800 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: Seems all be ok. I think you could try to delete the additional configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY, or use default or other options. On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao wrote: > Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each of > the 10 nodes : > > np=16 gpus=1 > > Thanks, > Kunal > > > On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia wrote: > >> How many cores on each of the 10 nodes ? I mean you are trying to >> allocate 2 processors on one node. And how did you >> configure TORQUE_HOME/server_priv/nodes ? >> >> >> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao wrote: >> >>> Queue / Server configuration : >>> >>> --------------- >>> >>> qmgr -c 'p s' >>> # >>> # Create queues and set their attributes. >>> # >>> # >>> # Create and define queue batch >>> # >>> create queue batch >>> set queue batch queue_type = Execution >>> set queue batch resources_default.nodes = 1 >>> set queue batch resources_default.walltime = 01:00:00 >>> set queue batch enabled = True >>> set queue batch started = True >>> # >>> # Set server attributes. >>> # >>> set server scheduling = True >>> set server acl_hosts = fire16 >>> set server acl_roots = root at fire16.csa.local >>> set server managers = root at fire16.csa.local >>> set server operators = root at fire16.csa.local >>> set server default_queue = batch >>> set server log_events = 511 >>> set server mail_from = adm >>> set server scheduler_iteration = 20 >>> set server node_check_rate = 150 >>> set server tcp_timeout = 6 >>> set server mom_job_sync = True >>> set server keep_completed = 300 >>> set server allow_node_submit = True >>> set server next_job_number = 6331 >>> >>> --------------- >>> >>> Job resource requirement : >>> >>> --------- >>> >>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 >>> >>> --------- >>> >>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all >>> accessible. >>> >>> Thanks, >>> Kunal >>> >>> >>> On 5/31/12, Ju JiaJia wrote: >>> > Please give your queue/server configuration and your job's resources >>> need, >>> > cpu/memory etc. And Does all the 10 nodes accessable? You can use >>> pbsnodes >>> > to check this. >>> > >>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao >>> wrote: >>> > >>> >> Hello, >>> >> >>> >> Please see the below message. I had posted it on maui users mailing >>> list, >>> >> but did not get any response, so thought of posting it here on torque >>> >> users >>> >> mailing list (incase someone would know). Kindly let me know if you >>> have >>> >> any comments / ideas / suggestions. >>> >> >>> >> Thanks, >>> >> Kunal >>> >> >>> >> ---------- Forwarded message ---------- >>> >> From: Kunal Rao >>> >> Date: Wed, May 23, 2012 at 2:30 PM >>> >> Subject: Re: Multi-req job not starting >>> >> To: mauiusers at supercluster.org >>> >> >>> >> >>> >> There was a similar post earlier : >>> >> >>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >>> >> >>> >> But did not find any response to it. Can anyone please provide some >>> ideas >>> >> / suggestion on this issue. >>> >> >>> >> Thanks, >>> >> Kunal >>> >> >>> >> >>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao >>> wrote: >>> >> >>> >>> Hello, >>> >>> >>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( >>> with >>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per node) >>> >>> and >>> >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 task >>> >>> each >>> >>> on the other 3 nodes ). >>> >>> >>> >>> Additional configuration in maui.cfg is : >>> >>> >>> >>> BACKFILLPOLICY FIRSTFIT >>> >>> RESERVATIONPOLICY CURRENTHIGHEST >>> >>> >>> >>> ENABLEMULTIREQJOBS TRUE >>> >>> NODEALLOCATIONPOLICY MINRESOURCE >>> >>> NODEACCESSPOLICY SINGLEJOB >>> >>> JOBNODEMATCHPOLICY EXACTNODE >>> >>> >>> >>> I am observing that if the first 2 jobs are running, the third one >>> does >>> >>> not start ( even though 4 nodes are available ) until 1 of the jobs >>> >>> complete. With checkjob -v it shows the following output : >>> >>> >>> >>> ------------------ >>> >>> >>> >>> checking job 5791 (RM job '5791.fire16.csa.local') >>> >>> >>> >>> State: Idle >>> >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >>> >>> WallTime: 00:00:00 of 00:04:51 >>> >>> SubmitTime: Wed May 23 11:52:04 >>> >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >>> >>> >>> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >>> >>> Total Tasks: 2 >>> >>> >>> >>> Req[0] TaskCount: 2 Partition: ALL >>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> >>> Dedicated Resources Per Task: PROCS: 1 >>> >>> NodeAccess: SINGLEJOB >>> >>> TasksPerNode: 2 NodeCount: 1 >>> >>> >>> >>> Req[1] TaskCount: 3 Partition: ALL >>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> >>> Dedicated Resources Per Task: PROCS: 1 >>> >>> NodeAccess: SINGLEJOB >>> >>> NodeCount: 3 >>> >>> >>> >>> >>> >>> IWD: [NONE] Executable: [NONE] >>> >>> Bypass: 5 StartCount: 0 >>> >>> PartitionMask: [ALL] >>> >>> Flags: RESTARTABLE >>> >>> >>> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >>> >>> PE: 5.00 StartPriority: 48 >>> >>> cannot select job 5791 for partition DEFAULT (startdate in >>> '00:00:01') >>> >>> >>> >>> ------------ >>> >>> >>> >>> What could be the reason for not starting this job ? How do I resolve >>> >>> this ? >>> >>> >>> >>> Thanks, >>> >>> Kunal >>> >>> >>> >> >>> >> >>> >> >>> >> _______________________________________________ >>> >> torqueusers mailing list >>> >> torqueusers at supercluster.org >>> >> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >>> >> >>> > >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120601/d7de398e/attachment.html From kunalgrao at gmail.com Thu May 31 20:26:30 2012 From: kunalgrao at gmail.com (Kunal Rao) Date: Thu, 31 May 2012 22:26:30 -0400 Subject: [torqueusers] Fwd: Multi-req job not starting In-Reply-To: References: Message-ID: I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check tomorrow. Thanks, Kunal On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia wrote: > Seems all be ok. I think you could try to delete the additional > configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY, > or use default or other options. > > > On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao wrote: > >> Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each of >> the 10 nodes : >> >> np=16 gpus=1 >> >> Thanks, >> Kunal >> >> >> On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia wrote: >> >>> How many cores on each of the 10 nodes ? I mean you are trying to >>> allocate 2 processors on one node. And how did you >>> configure TORQUE_HOME/server_priv/nodes ? >>> >>> >>> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao wrote: >>> >>>> Queue / Server configuration : >>>> >>>> --------------- >>>> >>>> qmgr -c 'p s' >>>> # >>>> # Create queues and set their attributes. >>>> # >>>> # >>>> # Create and define queue batch >>>> # >>>> create queue batch >>>> set queue batch queue_type = Execution >>>> set queue batch resources_default.nodes = 1 >>>> set queue batch resources_default.walltime = 01:00:00 >>>> set queue batch enabled = True >>>> set queue batch started = True >>>> # >>>> # Set server attributes. >>>> # >>>> set server scheduling = True >>>> set server acl_hosts = fire16 >>>> set server acl_roots = root at fire16.csa.local >>>> set server managers = root at fire16.csa.local >>>> set server operators = root at fire16.csa.local >>>> set server default_queue = batch >>>> set server log_events = 511 >>>> set server mail_from = adm >>>> set server scheduler_iteration = 20 >>>> set server node_check_rate = 150 >>>> set server tcp_timeout = 6 >>>> set server mom_job_sync = True >>>> set server keep_completed = 300 >>>> set server allow_node_submit = True >>>> set server next_job_number = 6331 >>>> >>>> --------------- >>>> >>>> Job resource requirement : >>>> >>>> --------- >>>> >>>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 >>>> >>>> --------- >>>> >>>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all >>>> accessible. >>>> >>>> Thanks, >>>> Kunal >>>> >>>> >>>> On 5/31/12, Ju JiaJia wrote: >>>> > Please give your queue/server configuration and your job's resources >>>> need, >>>> > cpu/memory etc. And Does all the 10 nodes accessable? You can use >>>> pbsnodes >>>> > to check this. >>>> > >>>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao >>>> wrote: >>>> > >>>> >> Hello, >>>> >> >>>> >> Please see the below message. I had posted it on maui users mailing >>>> list, >>>> >> but did not get any response, so thought of posting it here on torque >>>> >> users >>>> >> mailing list (incase someone would know). Kindly let me know if you >>>> have >>>> >> any comments / ideas / suggestions. >>>> >> >>>> >> Thanks, >>>> >> Kunal >>>> >> >>>> >> ---------- Forwarded message ---------- >>>> >> From: Kunal Rao >>>> >> Date: Wed, May 23, 2012 at 2:30 PM >>>> >> Subject: Re: Multi-req job not starting >>>> >> To: mauiusers at supercluster.org >>>> >> >>>> >> >>>> >> There was a similar post earlier : >>>> >> >>>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >>>> >> >>>> >> But did not find any response to it. Can anyone please provide some >>>> ideas >>>> >> / suggestion on this issue. >>>> >> >>>> >> Thanks, >>>> >> Kunal >>>> >> >>>> >> >>>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao >>>> wrote: >>>> >> >>>> >>> Hello, >>>> >>> >>>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes ( >>>> with >>>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per >>>> node) >>>> >>> and >>>> >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 >>>> task >>>> >>> each >>>> >>> on the other 3 nodes ). >>>> >>> >>>> >>> Additional configuration in maui.cfg is : >>>> >>> >>>> >>> BACKFILLPOLICY FIRSTFIT >>>> >>> RESERVATIONPOLICY CURRENTHIGHEST >>>> >>> >>>> >>> ENABLEMULTIREQJOBS TRUE >>>> >>> NODEALLOCATIONPOLICY MINRESOURCE >>>> >>> NODEACCESSPOLICY SINGLEJOB >>>> >>> JOBNODEMATCHPOLICY EXACTNODE >>>> >>> >>>> >>> I am observing that if the first 2 jobs are running, the third one >>>> does >>>> >>> not start ( even though 4 nodes are available ) until 1 of the jobs >>>> >>> complete. With checkjob -v it shows the following output : >>>> >>> >>>> >>> ------------------ >>>> >>> >>>> >>> checking job 5791 (RM job '5791.fire16.csa.local') >>>> >>> >>>> >>> State: Idle >>>> >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >>>> >>> WallTime: 00:00:00 of 00:04:51 >>>> >>> SubmitTime: Wed May 23 11:52:04 >>>> >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >>>> >>> >>>> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >>>> >>> Total Tasks: 2 >>>> >>> >>>> >>> Req[0] TaskCount: 2 Partition: ALL >>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>> >>> NodeAccess: SINGLEJOB >>>> >>> TasksPerNode: 2 NodeCount: 1 >>>> >>> >>>> >>> Req[1] TaskCount: 3 Partition: ALL >>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>> >>> NodeAccess: SINGLEJOB >>>> >>> NodeCount: 3 >>>> >>> >>>> >>> >>>> >>> IWD: [NONE] Executable: [NONE] >>>> >>> Bypass: 5 StartCount: 0 >>>> >>> PartitionMask: [ALL] >>>> >>> Flags: RESTARTABLE >>>> >>> >>>> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >>>> >>> PE: 5.00 StartPriority: 48 >>>> >>> cannot select job 5791 for partition DEFAULT (startdate in >>>> '00:00:01') >>>> >>> >>>> >>> ------------ >>>> >>> >>>> >>> What could be the reason for not starting this job ? How do I >>>> resolve >>>> >>> this ? >>>> >>> >>>> >>> Thanks, >>>> >>> Kunal >>>> >>> >>>> >> >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> torqueusers mailing list >>>> >> torqueusers at supercluster.org >>>> >> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >> >>>> >> >>>> > >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120531/600dff1a/attachment-0001.html From damianmontaldo at gmail.com Thu May 31 18:30:55 2012 From: damianmontaldo at gmail.com (Damian Montaldo) Date: Thu, 31 May 2012 21:30:55 -0300 Subject: [torqueusers] Segmentation fault when using OpenMPI -pernode option Message-ID: Hi, I need some help with Torque and a specific option of OpenMPI. I have nodes with 4 processors each and I want to launch only one process in each node using the pernode option because I need restrict that torque is not going to queue another jobs in that node. As the manual says: On each node, launch one process (-- equivalent to -npernode 1) This is the error I got. I try to google it but a segmentation fault it's a very common error and it's very common too to found it related to the binary (executed by mpiexec) and I think that this is a specific Torque error because running mpirun with the host file and the pernode it seems to work. $ cat TEST.e37495 [n52:04352] *** Process received signal *** [n52:04352] Signal: Segmentation fault (11) [n52:04352] Signal code: Address not mapped (1) [n52:04352] Failing at address: 0x50 [n52:04352] [ 0] /lib/libpthread.so.0(+0xeff0) [0x2aca79ff4ff0] [n52:04352] [ 1] /usr/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbc) [0x2aca792c334c] [n52:04352] [ 2] /usr/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d4) [0x2aca792d1ea4] [n52:04352] [ 3] /usr/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x11e) [0x2aca792d596e] [n52:04352] [ 4] /usr/lib/openmpi/lib/openmpi/mca_plm_tm.so(+0x1d4a) [0x2aca7b382d4a] [n52:04352] [ 5] mpiexec() [0x403aaf] [n52:04352] [ 6] mpiexec() [0x402f74] [n52:04352] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x2aca7a220c8d] [n52:04352] [ 8] mpiexec() [0x402e99] [n52:04352] *** End of error message *** /var/spool/torque/mom_priv/jobs/37495....SC: line 107: 4352 Segmentation fault mpiexec -verbose -pernode -np $NP python ..args... [n48:15977] [[10692,0],2] routed:binomial: Connection to lifeline [[10692,0],0] lost [n49:15992] [[10692,0],1] routed:binomial: Connection to lifeline [[10692,0],0] lost [n42:16290] [[10692,0],3] routed:binomial: Connection to lifeline [[10692,0],0] lost $ qstat -f 37495 Job Id: 37495 Job_Name = TEST resources_used.cput = 00:00:00 resources_used.mem = 532kb resources_used.vmem = 9056kb resources_used.walltime = 00:00:01 job_state = C queue = batch server = n0 Checkpoint = u ctime = Thu May 31 20:42:47 2012 exec_host = n52/3+n52/2+n52/1+n52/0+n49/3+n49/2+n49/1+n49/0+n48/3+n48/2+n4 8/1+n48/0+n42/3+n42/2+n42/1+n42/0 Hold_Types = n Join_Path = n Keep_Files = eo Mail_Points = abe mtime = Thu May 31 20:43:21 2012 Priority = 0 qtime = Thu May 31 20:42:47 2012 Rerunable = True Resource_List.nodect = 4 Resource_List.nodes = 4:ppn=4 Resource_List.walltime = 01:00:00 session_id = 4342 Variable_List = PBS_O_LANG=es_AR.UTF-8, PBS_O_LOGNAME=dfslezak, PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games, PBS_O_SHELL=/bin/bash,PBS_SERVER=n0, PBS_O_QUEUE=batch, PBS_O_HOST=n0 comment = Job started on Thu May 31 at 20:43 etime = Thu May 31 20:42:47 2012 exit_status = 0 submit_args = -l walltime=1:00:00 start_time = Thu May 31 20:43:21 2012 Walltime.Remaining = 360 start_count = 1 fault_tolerant = False comp_time = Thu May 31 20:43:21 2012 $ mpiexec --version mpiexec (OpenRTE) 1.4.2 I doesn't to be related to python but this is the version. $ python --version Python 2.6.6 It a Debian Linux (squeeze up to date) with this Torque version $ dpkg -l | grep torque ii libtorque2 2.4.8+dfsg-9squeeze1 shared library for Torque client and server ii torque-client 2.4.8+dfsg-9squeeze1 command line interface to Torque server ii torque-common 2.4.8+dfsg-9squeeze1 Torque Queueing System shared files ii torque-mom 2.4.8+dfsg-9squeeze1 job execution engine for Torque batch system ii torque-scheduler 2.4.8+dfsg-9squeeze1 scheduler part of Torque ii torque-server 2.4.8+dfsg-9squeeze1 PBS-derived batch processing server If you need more specific info (perhaps a qmgr print server?) just tell, and of course, any help would be very appreciated!