From Gareth.Williams at csiro.au Tue Nov 1 17:17:16 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 2 Nov 2011 10:17:16 +1100 Subject: [torqueusers] job dependencies, requeuing and routing Message-ID: <007DECE986B47F4EABF823C1FBB19C620102BE58CAF3@exvic-mbx04.nexus.csiro.au> Hi All, We recently had a few events as follows: - job 'A' was queued followed by job 'B' depending on 'A' - when the scheduler decided to start job 'A' the pbs_mom failed to start the job within some timelimit and 'A' went back to the queue (the node was apparently particularly busy) - some time later job 'A' ran successfully, but the dependency for job 'B' on "A' was not deemed to be satisfied and job 'B' was left stranded. Many other jobs with similar dependencies have been working fine. I suspect the problem is related to the jobs being submitted to a routing queue. The routing setup is simple in that it (mostly) just puts jobs in a single execution queue. The requeued job goes back to the routing queue and the dependent job stays in the execution queue. I'm not sure how to reproduce the job start failure - which makes it difficult to diagnose the problem further. We are running 3.0.3-snap.201108261653 and moab client 6.0.2 (revision 3, changeset b88217da5915d0a5ec6480f06677cb36e5fa7305) Has anyone seen similar/related problems? Or does anyone know enough about job dependencies to know if the implementation might have a bug when combined with routing and job start failure in this way? Do you think the problem is in torque or moab? Regards, Gareth From carlos.camara at wimasis.com Wed Nov 2 02:31:22 2011 From: carlos.camara at wimasis.com (=?ISO-8859-1?Q?=22Carlos_M=2E_C=E1mara=22?=) Date: Wed, 02 Nov 2011 09:31:22 +0100 Subject: [torqueusers] using non-privileged ports In-Reply-To: References: Message-ID: <4EB0FFDA.10709@wimasis.com> Hi Martin!! I use Torque without privports and Munge and it works fine. The first reason that comes to my mind is that you compile the client with --disable-privports, but somehow not the server or that in your server is still working the privport version. Another good reason for this to not work could be a firewall rule that is avoiding the communication. All best. On 10/28/2011 10:57 PM, torqueusers-request at supercluster.org wrote: > Date: Fri, 28 Oct 2011 10:39:56 -0700 > From: Martin Siegert > Subject: [torqueusers] using non-privileged ports > To:torqueusers at supercluster.org > Message-ID:<20111028173956.GB5953 at stikine.sfu.ca> > Content-Type: text/plain; charset=us-ascii > > Hi, > > we just recompiled torque with > > --disable-privports > > (since we constantly ran out of ports). Now we have a different > problem which is just as bad: > > # qstat -an1 > Connection timed out > qstat: cannot connect to server b0 (errno=110) Connection timed out > > This does not appear right away after starting the server, but after > a few hours of running. As far as I can tell the only way to get the > server out of this state is to restart it. > > But there must be many sites that run torque with --disable-privports. > Thus: what am I missing? > > Cheers, > Martin > > -- Martin Siegert Head, Research Computing Simon Fraser University -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111102/a4f84a6c/attachment-0001.html From Eirikur.Hjartarson at decode.is Wed Nov 2 04:29:42 2011 From: Eirikur.Hjartarson at decode.is (=?iso-8859-1?Q?Eir=EDkur_Hjartarson?=) Date: Wed, 2 Nov 2011 10:29:42 +0000 Subject: [torqueusers] job dependencies, requeuing and routing In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102BE58CAF3@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102BE58CAF3@exvic-mbx04.nexus.csiro.au> Message-ID: <4C3D9F9382FE07458AF8E79B234B4BD8AF325B@smbx.decode.is> Hi, I described similar problems in http://www.clusterresources.com/pipermail/torqueusers/2011-September/013404.html that are still unresolved. We are using torque 2.5.8 (and maui 3.3.1). Regards, -- Eir?kur Hjartarson E-mail: Eirikur.Hjartarson at decode.is ?slensk Erf?agreining Mobile: +3546641898 Sturlug?tu 7 IS-101 Reykjav?k ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] on behalf of Gareth.Williams at csiro.au [Gareth.Williams at csiro.au] Sent: Tuesday, November 01, 2011 11:17 PM To: torqueusers at supercluster.org; moabusers at supercluster.org Subject: [torqueusers] job dependencies, requeuing and routing Hi All, We recently had a few events as follows: - job 'A' was queued followed by job 'B' depending on 'A' - when the scheduler decided to start job 'A' the pbs_mom failed to start the job within some timelimit and 'A' went back to the queue (the node was apparently particularly busy) - some time later job 'A' ran successfully, but the dependency for job 'B' on "A' was not deemed to be satisfied and job 'B' was left stranded. Many other jobs with similar dependencies have been working fine. I suspect the problem is related to the jobs being submitted to a routing queue. The routing setup is simple in that it (mostly) just puts jobs in a single execution queue. The requeued job goes back to the routing queue and the dependent job stays in the execution queue. I'm not sure how to reproduce the job start failure - which makes it difficult to diagnose the problem further. We are running 3.0.3-snap.201108261653 and moab client 6.0.2 (revision 3, changeset b88217da5915d0a5ec6480f06677cb36e5fa7305) Has anyone seen similar/related problems? Or does anyone know enough about job dependencies to know if the implementation might have a bug when combined with routing and job start failure in this way? Do you think the problem is in torque or moab? Regards, Gareth _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From roy.dragseth at cc.uit.no Thu Nov 3 02:58:56 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Thu, 3 Nov 2011 09:58:56 +0100 Subject: [torqueusers] Disabling cpusets. Message-ID: <201111030958.56644.roy.dragseth@cc.uit.no> Hi. Is it possible to have cpusets compiled in, but disable it at runtime? I did some code spelunking and couldn't find any way to turn it off. Did I overlook anything? r. From roy.dragseth at cc.uit.no Thu Nov 3 03:19:44 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Thu, 3 Nov 2011 10:19:44 +0100 Subject: [torqueusers] memory_pressure and oom-kill. Message-ID: <201111031019.44274.roy.dragseth@cc.uit.no> I've been playing around with cpusets in torque and stumbled across the memory_pressure thingies you can configure to make pbs_mom take action against jobs growing out of their memory limits. I got very excited to say the least, as this might make it possible for us to allow jobs from multiple users run on the same compute node. (We currently run a SINGLEUSER policy in maui and thus takes a hit of around 5-10% on the utilization. This is likely to worsen as we move towards more cores per node.) However, some testing revealed a very serious issue: If a job passes its memory_pressure limits it will be killed no matter if it is overallocation its memory or not. So if you allow multiple jobs from multiple users to run on a compute node you can get into scenarios where a well-behaving job from userA is killed because userB did something stupid. Not a good situation. It should be fairly simple to check the memory consumption of the job before deciding to take action. One could of course rely on oom-kill to take action, but that doesn't come into play until all the swap is consumed. That is too late for HPC uses, right? r. From dbeer at adaptivecomputing.com Thu Nov 3 09:59:48 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 03 Nov 2011 09:59:48 -0600 (MDT) Subject: [torqueusers] Disabling cpusets. In-Reply-To: <201111030958.56644.roy.dragseth@cc.uit.no> Message-ID: <542ef575-2763-462e-ad45-ed60b1830202@mail> ----- Original Message ----- > Hi. > > Is it possible to have cpusets compiled in, but disable it at > runtime? > > I did some code spelunking and couldn't find any way to turn it off. > Did I > overlook anything? > > r. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > Roy, We don't have any way to turn cpusets off at runtime, although there is a way to make it happen in TORQUE: Look at the geometry requests feature: http://www.adaptivecomputing.com/resources/docs/torque/3.6schedulingcores.php Note that if geometry requests are configured by themselves, then you only get a cpuset if you explicitly request one. In order to implement this feature, you'd need to write a submit filter that adds the geometry request to the job unless some parameter is passed in (you get to make up whatever parameter you want this to be, and just don't pass it through to qsub from your filter). Essentially, you can take the submit filter here: http://www.adaptivecomputing.com/resources/docs/torque/a.jqsubwrapper.php and add logic to save a parameter request if it is there (your special parameter) and if it isn't there, then you also write the appropriate geometry request to STDOUT. Note that geometry requests make nodes exclusive-use only, so based on your other email it may or may not be a viable solution. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Thu Nov 3 10:01:56 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 03 Nov 2011 10:01:56 -0600 (MDT) Subject: [torqueusers] memory_pressure and oom-kill. In-Reply-To: <201111031019.44274.roy.dragseth@cc.uit.no> Message-ID: <350c2bb0-c802-4659-b91e-3cf453221c5e@mail> ----- Original Message ----- > I've been playing around with cpusets in torque and stumbled across > the > memory_pressure thingies you can configure to make pbs_mom take > action against > jobs growing out of their memory limits. I got very excited to say > the least, > as this might make it possible for us to allow jobs from multiple > users run on > the same compute node. (We currently run a SINGLEUSER policy in maui > and thus > takes a hit of around 5-10% on the utilization. This is likely to > worsen as > we move towards more cores per node.) > However, some testing revealed a very serious issue: If a job > passes its > memory_pressure limits it will be killed no matter if it is > overallocation its > memory or not. So if you allow multiple jobs from multiple users to > run on a > compute node you can get into scenarios where a well-behaving job > from userA > is killed because userB did something stupid. Not a good situation. > It > should be fairly simple to check the memory consumption of the job > before > deciding to take action. > I'm not certain if a well-behaving job would get killed or not. I think it'd be good to run some tests and see how it happens in practice, although I certainly see the possibility of a well-behaving job also getting killed. > One could of course rely on oom-kill to take action, but that doesn't > come > into play until all the swap is consumed. That is too late for HPC > uses, > right? Other users/customers have found the OOM to come into play too late to help them. > > r. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From djohnson at osc.edu Thu Nov 3 10:06:24 2011 From: djohnson at osc.edu (Doug Johnson) Date: Thu, 03 Nov 2011 12:06:24 -0400 Subject: [torqueusers] Disabling cpusets. In-Reply-To: <542ef575-2763-462e-ad45-ed60b1830202@mail> References: <201111030958.56644.roy.dragseth@cc.uit.no> <542ef575-2763-462e-ad45-ed60b1830202@mail> Message-ID: At Thu, 3 Nov 2011 09:59:48 -0600, David Beer wrote: > > > > ----- Original Message ----- > > Hi. > > > > Is it possible to have cpusets compiled in, but disable it at > > runtime? > > > > I did some code spelunking and couldn't find any way to turn it off. > > Did I > > overlook anything? > > > > r. > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > Roy, > > We don't have any way to turn cpusets off at runtime, although there is a way to make it happen in TORQUE: > > Look at the geometry requests feature: http://www.adaptivecomputing.com/resources/docs/torque/3.6schedulingcores.php > > Note that if geometry requests are configured by themselves, then you only get a cpuset if you explicitly request one. In order to implement this feature, you'd need to write a submit filter that adds the geometry request to the job unless some parameter is passed in (you get to make up whatever parameter you want this to be, and just don't pass it through to qsub from your filter). Essentially, you can take the submit filter here: http://www.adaptivecomputing.com/resources/docs/torque/a.jqsubwrapper.php and add logic to save a parameter request if it is there (your special parameter) and if it isn't there, then you also write the appropriate geometry request to STDOUT. Note that geometry requests make nodes exclusive-use only, so based on your other email it may or may not be a viable solution. > How graceful is the cpusets code if the cpuset/cgroup pseudo file system is not mounted? Doug From david at unistra.fr Thu Nov 3 10:24:57 2011 From: david at unistra.fr (R. David) Date: Thu, 3 Nov 2011 17:24:57 +0100 Subject: [torqueusers] Disabling cpusets. In-Reply-To: References: <201111030958.56644.roy.dragseth@cc.uit.no> <542ef575-2763-462e-ad45-ed60b1830202@mail> Message-ID: Le 3 nov. 2011 ? 17:06, Doug Johnson a ?crit : > At Thu, 3 Nov 2011 09:59:48 -0600, > David Beer wrote: >> >> >> >> ----- Original Message ----- >>> Hi. >>> >>> Is it possible to have cpusets compiled in, but disable it at >>> runtime? >>> >>> I did some code spelunking and couldn't find any way to turn it off. >>> Did I >>> overlook anything? >>> >>> r. >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> Roy, >> >> We don't have any way to turn cpusets off at runtime, although there is a way to make it happen in TORQUE: >> >> Look at the geometry requests feature: http://www.adaptivecomputing.com/resources/docs/torque/3.6schedulingcores.php >> >> Note that if geometry requests are configured by themselves, then you only get a cpuset if you explicitly request one. In order to implement this feature, you'd need to write a submit filter that adds the geometry request to the job unless some parameter is passed in (you get to make up whatever parameter you want this to be, and just don't pass it through to qsub from your filter). Essentially, you can take the submit filter here: http://www.adaptivecomputing.com/resources/docs/torque/a.jqsubwrapper.php and add logic to save a parameter request if it is there (your special parameter) and if it isn't there, then you also write the appropriate geometry request to STDOUT. Note that geometry requests make nodes exclusive-use only, so based on your other email it may or may not be a viable solution. >> > > > How graceful is the cpusets code if the cpuset/cgroup pseudo file > system is not mounted? Not graceful at all : given our experience, it cannot start the job. Regards, R. David From gus at ldeo.columbia.edu Thu Nov 3 11:54:25 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 3 Nov 2011 13:54:25 -0400 Subject: [torqueusers] Disabling cpusets. In-Reply-To: <201111030958.56644.roy.dragseth@cc.uit.no> References: <201111030958.56644.roy.dragseth@cc.uit.no> Message-ID: Hi Roy In my experience, if you do not create /dev/cpuset and do not mount it, then the cpuset feature is disabled at runtime, even if Torque was compiled with cpuset enabled. I posted a quick and dirty way to enable it in the list some time ago in the Torque list. Can't find it now. Basically adding these lines to the pbs_mom init script: if [ ! -e /dev/cpuset ];then mkdir /dev/cpuset fi if [ "`mount -t cpuset`x" == "x" ];then mount -t cpuset none /dev/cpuset fi If this is not done by pbs_mom, or by another system script, I presume cpuset is disabled. I use Torque 2.4.11 and 2.5.4 in two different clusters. Not sure if the later versions of Torque do this differently. I hope this helps, Gus Correa On Nov 3, 2011, at 4:58 AM, Roy Dragseth wrote: > Hi. > > Is it possible to have cpusets compiled in, but disable it at runtime? > > I did some code spelunking and couldn't find any way to turn it off. Did I > overlook anything? > > r. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Thu Nov 3 16:09:28 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 4 Nov 2011 09:09:28 +1100 Subject: [torqueusers] Disabling cpusets. In-Reply-To: <201111030958.56644.roy.dragseth@cc.uit.no> References: <201111030958.56644.roy.dragseth@cc.uit.no> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102BE58CB04@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Roy Dragseth [mailto:roy.dragseth at cc.uit.no] > Sent: Thursday, 3 November 2011 7:59 PM > To: torqueusers at supercluster.org > Subject: [torqueusers] Disabling cpusets. > > Hi. > > Is it possible to have cpusets compiled in, but disable it at runtime? > > I did some code spelunking and couldn't find any way to turn it off. > Did I > overlook anything? > > r. Hi Roy, You could (probably) manipulate the cpuset in the (privileged) job prologue. This is relatively easy in the case where a job uses a single node but would be trickier for multi-node jobs. Gareth From Gareth.Williams at csiro.au Thu Nov 3 17:42:21 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 4 Nov 2011 10:42:21 +1100 Subject: [torqueusers] Two problems with a routing queue Message-ID: <007DECE986B47F4EABF823C1FBB19C620102BE58CB05@exvic-mbx04.nexus.csiro.au> Getting back to this thread, Gus said: > What makes 'afterok' to be true? > Is it an empty stderr? > Something else? > Often times programs dump warning messages [not errors] in stderr, > the job ends 'OK' but stderr is not empty. > I prefer to use 'afterany' because of this doubt. > Thank you, > Gus Correa The exit code that the job returns determines if afterok is true. That is the exit code of the last commend in the script (assuming it doesn't get eaten by the shell logout if there is any).It's available in the server and accounting logs as Exit_status and in qstat info for complete jobs as exit_status. Sometimes you want the exit code of another command (not the last one) so do the following to make the last command exit with a saved value (in bash): Command_I_care_about some args myexitcode = $? More_commands_eg_to_clean_up exit $myexitcode Gareth From Gareth.Williams at csiro.au Thu Nov 3 22:22:09 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 4 Nov 2011 15:22:09 +1100 Subject: [torqueusers] job dependencies, requeuing and routing In-Reply-To: <4C3D9F9382FE07458AF8E79B234B4BD8AF325B@smbx.decode.is> References: <007DECE986B47F4EABF823C1FBB19C620102BE58CAF3@exvic-mbx04.nexus.csiro.au> <4C3D9F9382FE07458AF8E79B234B4BD8AF325B@smbx.decode.is> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102BE58CB06@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Eir?kur Hjartarson [mailto:Eirikur.Hjartarson at decode.is] > Sent: Wednesday, 2 November 2011 9:30 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] job dependencies, requeuing and routing > > Hi, > > I described similar problems in > http://www.clusterresources.com/pipermail/torqueusers/2011- > September/013404.html that are still unresolved. We are using torque > 2.5.8 (and maui 3.3.1). > > Regards, > -- > Eir?kur Hjartarson E-mail: Eirikur.Hjartarson at decode.is > ?slensk Erf?agreining Mobile: +3546641898 > Sturlug?tu 7 > IS-101 Reykjav?k Thanks Eir?kur, I think the problems are indeed related and yours (the first one) should be easier to reproduce than mine. Your second problem is a separate problem I think which does not affect our site as we don't have the max_user_queuable limit. We have been thinking of adding such a limit (to lower memory use in the scheduler) but this problem is a good reason for us to hold off doing so. Fixing the first problem may be needed to enable jobs with prerequisites to be held in the routing queue and still get released properly. Ken, the job dependencies are being set up fine. The problem is the associated holds are not being released in some cases as the prerequisite jobs complete. You can probably reproduce this using qrun as the scheduler. - Set up queues qr (route) and qe (exec) with a small max_user_queuable limit (maybe 1). - Keep complete jobs (for a small time, say 2 minutes). - Submit to qr jobs ja, some intervening jobs (jc, jd, je) and jb with dependency on ja, holding ja until jb is submitted then release a - ja should route to the execution queue and jc, jd, je and jb stay in qr - qrun ja (on an available node) - when it's done, jc should route to qe and can be qrun (then jd and je) - jb should then route to qe but (I anticipate) will not qrun unless it's within the 2 minutes... If the max_user_queuable limit is higher and the jobs all go straight to qr, I think the problem will not occur. Later... Well that is what I thought would happen. I tested and the situation is worse than I'd imagined. I set max_user_queuable to 1 and that seems fine until you add in holds. If I submit a job with a hold (to make sure it's still there when I submit a dependent job) to the routing queue, then by default it does not route (route_held_jobs = false). Then I submit a dependent job and it does route so I have one job that can't run in the execution queue. The job that needs to run first is still held in the routing queue. I release it with qrls -h u but it can't route, presumably because the max_user_queuable limit is already satisfied! This is Eir?kur's second problem. If a just submit a single job with a hold to the routing queue and then release it, it routes to the execution queue. (Moab failed to start the job in my specific case, it seems because I have a mapping between classes/queues and nodes and it got confused... It was OK for another queue. Maui might have the same issue.). If I submit a job with a time dependent hold (qsub -a `date -d 'now + 5 minutes' +'%Y%m%d%H%M'` test.q) it stays in the routing queue until the time limit is met, then gets routed (saw the same moab issue here too). Trying harder and avoiding moab issues with a simpler queue and no max_user_queuable... If I submit a job with a hold and a dependent job with a hold (qsub -h -W depend...) I can then release the first hold (qrls -h u) and the first job runs. If I release the second job soon enough (while the first job is running or during the keep_completed period), it routes to the execution queue, and when the first job is finished, it starts. However, if I don't release the second job until after the first is finished and the keep_completed period is over, the released job routes to the execution queue but remain held. Releasing the job a second time once it's in the execution queue (qrls -h u) allows it to be started. I'm not sure where to go from here. What are the chances of these problems being fixed? Gareth (who is at a conference next week so worked hard to get testing done today) From roy.dragseth at cc.uit.no Fri Nov 4 02:37:12 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Fri, 4 Nov 2011 09:37:12 +0100 Subject: [torqueusers] Disabling cpusets. In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102BE58CB04@exvic-mbx04.nexus.csiro.au> References: <201111030958.56644.roy.dragseth@cc.uit.no> <007DECE986B47F4EABF823C1FBB19C620102BE58CB04@exvic-mbx04.nexus.csiro.au> Message-ID: <201111040937.12448.roy.dragseth@cc.uit.no> On Thursday, November 03, 2011 23:09:28 Gareth.Williams at csiro.au wrote: > > -----Original Message----- > > From: Roy Dragseth [mailto:roy.dragseth at cc.uit.no] > > Sent: Thursday, 3 November 2011 7:59 PM > > To: torqueusers at supercluster.org > > Subject: [torqueusers] Disabling cpusets. > > > > Hi. > > > > Is it possible to have cpusets compiled in, but disable it at runtime? > > > > I did some code spelunking and couldn't find any way to turn it off. > > Did I > > overlook anything? > > > > r. > > Hi Roy, > > You could (probably) manipulate the cpuset in the (privileged) job > prologue. This is relatively easy in the case where a job uses a single > node but would be trickier for multi-node jobs. > Thanks for all the replies. My testing shows that if cpusets is enabled in torque and /dev/cpuset is not mounted on the node the job will go into an endless restart loop. This is for torque 3.0.2. Looking at the code it should be fairly easy to test if /dev/cpuset is mounted and disable the cpuset stuff in torque if it isn't. The reason I'm asking is that I'm the maintainer of the torque-roll for the Rocks Cluster Distro and some of the users have requested if I could enable the cpusets in torque. This is easy to do, but I'm worried that this could suddenly break a lot of stuff for other users not interested in this. So having cpusets enabled in torque with the possibility to turn it off would be a nice thing to have. Would anyone object to a patch for this? r. From roy.dragseth at cc.uit.no Fri Nov 4 02:58:50 2011 From: roy.dragseth at cc.uit.no (Roy Dragseth) Date: Fri, 4 Nov 2011 09:58:50 +0100 Subject: [torqueusers] memory_pressure and oom-kill. In-Reply-To: <350c2bb0-c802-4659-b91e-3cf453221c5e@mail> References: <350c2bb0-c802-4659-b91e-3cf453221c5e@mail> Message-ID: <201111040958.50820.roy.dragseth@cc.uit.no> On Thursday, November 03, 2011 17:01:56 David Beer wrote: > ----- Original Message ----- > > > I've been playing around with cpusets in torque and stumbled across > > the > > memory_pressure thingies you can configure to make pbs_mom take > > action against > > jobs growing out of their memory limits. I got very excited to say > > the least, > > as this might make it possible for us to allow jobs from multiple > > users run on > > the same compute node. (We currently run a SINGLEUSER policy in maui > > and thus > > takes a hit of around 5-10% on the utilization. This is likely to > > worsen as > > we move towards more cores per node.) > > > > However, some testing revealed a very serious issue: If a job > > passes its > > > > memory_pressure limits it will be killed no matter if it is > > overallocation its > > memory or not. So if you allow multiple jobs from multiple users to > > run on a > > compute node you can get into scenarios where a well-behaving job > > from userA > > is killed because userB did something stupid. Not a good situation. > > > > It > > > > should be fairly simple to check the memory consumption of the job > > before > > deciding to take action. > > I'm not certain if a well-behaving job would get killed or not. I think > it'd be good to run some tests and see how it happens in practice, > although I certainly see the possibility of a well-behaving job also > getting killed. > > > One could of course rely on oom-kill to take action, but that doesn't > > come > > into play until all the swap is consumed. That is too late for HPC > > uses, > > right? > > Other users/customers have found the OOM to come into play too late to help > them. I have indeed tested the memory_pressure functionality and it behaves as I described. I ran two jobs using the stress application from http://weather.ou.edu/~apw/projects/stress/ (we use it a lot to test for faulty dimms) The compute node had 2GB RAM and 1GB swap. first job stress -m 1 --vm-bytes 1000M second job stress -m 1 --vm-bytes 1500M When the second job started, both jobs passed their memory_pressure limit and was killed by their respecitve moms after the prescribed grace period. I can see two immediate solutions to this problem 1. check the job's RSS against the prescribed pmem or mem value and kill it only if it has violated the limit. 2. trigger a user-definable script and leave it to the script to take the appropriate action. Both have pros and cons, and there might be better solutions. Any thoughts? r. From cedric.trivino at c-s.fr Fri Nov 4 01:59:11 2011 From: cedric.trivino at c-s.fr (TRIVINO =?iso-8859-1?b?Q+lkcmlj?=) Date: Fri, 04 Nov 2011 08:59:11 +0100 Subject: [torqueusers] Exit_status=271 Message-ID: <20111104085911.87851ysv9x3hj4nj@messagerie.si.c-s.fr> Hi at all, We currently use torque 2.4.12 and maui 3.2.6p21 on a cluster. A job that was running for several hours has been deleted at the request of maui (20 nodes and ppn=4). Here is a part of the torque's log : 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job deleted at request of maui@ 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job sent signal SIGTERM on delete 10/25/2011 11:08:47;0009;PBS_Server;Job;269580.;job exit status 271 handled 10/25/2011 11:08:48;0010;PBS_Server;Job;269580.;Exit_status=271 resources_used.cput=371:25:04 resources_used.mem=589796kb resources_used.vmem=12901480kb resources_used.walltime=114:25:11 I'm wondering what could be the cause of this exit status 271. The only causes i found were "More RAM than asked for or over allocated CPU time are the usual reasons". This doesn't seem to be the reason here. Any idea? Regards, C?dric. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From dbeer at adaptivecomputing.com Fri Nov 4 09:18:23 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 04 Nov 2011 09:18:23 -0600 (MDT) Subject: [torqueusers] Exit_status=271 In-Reply-To: <20111104085911.87851ysv9x3hj4nj@messagerie.si.c-s.fr> Message-ID: <6d9ab215-19cf-4f5b-98dc-89725f0b1445@mail> ----- Original Message ----- > Hi at all, > > We currently use torque 2.4.12 and maui 3.2.6p21 on a cluster. > > A job that was running for several hours has been deleted at the > request of maui (20 nodes and ppn=4). Here is a part of the torque's > log : > > 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job deleted at > request > of maui@ > 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job sent signal > SIGTERM on delete > 10/25/2011 11:08:47;0009;PBS_Server;Job;269580.;job exit status 271 > handled > 10/25/2011 11:08:48;0010;PBS_Server;Job;269580.;Exit_status=271 > resources_used.cput=371:25:04 resources_used.mem=589796kb > resources_used.vmem=12901480kb resources_used.walltime=114:25:11 > > > I'm wondering what could be the cause of this exit status 271. > The only causes i found were "More RAM than asked for or over > allocated CPU time are the usual reasons". > This doesn't seem to be the reason here. > > Any idea? > > Regards, > C?dric. This is because the job is being killed by signal 15, its an oddity in linux. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From lloyd_brown at byu.edu Fri Nov 4 09:34:59 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Fri, 04 Nov 2011 09:34:59 -0600 Subject: [torqueusers] Exit_status=271 In-Reply-To: <6d9ab215-19cf-4f5b-98dc-89725f0b1445@mail> References: <6d9ab215-19cf-4f5b-98dc-89725f0b1445@mail> Message-ID: <4EB40623.7020706@byu.edu> I know that exit status gets offset by some number (128? 256?), but it's not clear to me whether there is a correlation between the signal number (SIGTERM, or signal 15), and the program's exit status. If a program that is killed by signal 15, sends a exit code of 15, and if the offset is 256, that would explain the exit code you see of 271 (256+15). >From the snippet of logs, it looks like Maui decided somehow to delete the job. SIGTERM (15) is the first signal that Torque sends to the job's process; if it fails to exit in a short period, it then sends SIGKILL (9), which can't be caught/ignored. We sometimes have users catch TERM in their job script, and do some cleanup. I'd look into why Maui decided to delete it, if I were you. That's likely the root of the problem. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 11/04/2011 09:18 AM, David Beer wrote: > > > ----- Original Message ----- >> Hi at all, >> >> We currently use torque 2.4.12 and maui 3.2.6p21 on a cluster. >> >> A job that was running for several hours has been deleted at the >> request of maui (20 nodes and ppn=4). Here is a part of the torque's >> log : >> >> 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job deleted at >> request >> of maui@ >> 10/25/2011 11:08:46;0008;PBS_Server;Job;269580.;Job sent signal >> SIGTERM on delete >> 10/25/2011 11:08:47;0009;PBS_Server;Job;269580.;job exit status 271 >> handled >> 10/25/2011 11:08:48;0010;PBS_Server;Job;269580.;Exit_status=271 >> resources_used.cput=371:25:04 resources_used.mem=589796kb >> resources_used.vmem=12901480kb resources_used.walltime=114:25:11 >> >> >> I'm wondering what could be the cause of this exit status 271. >> The only causes i found were "More RAM than asked for or over >> allocated CPU time are the usual reasons". >> This doesn't seem to be the reason here. >> >> Any idea? >> >> Regards, >> C?dric. > > This is because the job is being killed by signal 15, its an oddity in linux. > From knielson at adaptivecomputing.com Sat Nov 5 20:08:27 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Sat, 05 Nov 2011 20:08:27 -0600 (MDT) Subject: [torqueusers] TORQUE 2.5.9 released In-Reply-To: Message-ID: Hi all, TORQUE 2.5.9 is now available for download. There were several bug fixes to TORQUE 2.5.9. Following are a list of noteable bug fixes. Added function DIS_tcp_close which frees buffer memory used for sending and receiving tcp data. This reduces the running memory size of TORQUE. Fix for a server seg-fault when using the record_job_info. Fix for afteranyarray and afterokarry where dependent jobs would not run after the dependent array requirements were satisfied. Fix to delete .AR array files from the $TORQUE_HOME/server_priv/arrays directory. Fix to recover previous state of job arrays between restarts of pbs_server Fix to prevent the server from hanging when moving jobs from one server to another server Fix to stop a segfault if using munge and the munge daemon was not running Security fix to munge authorization to prevent users from gaining access to TORQUE when munge was not running. Fix to allow pam_pbssimpleauth to work properly. A new torque.cfg option as added named TRQ_IFNAME. This option allows the administrator to select the outbound tcp interface by interface name for qsub commands. To see a compelete list of changes please see the CHANGELOG. TORQUE 2.5.9 can be downloaded from http://www.clusterresources.com/downloads/torque/torque-2.5.9.tar.gz Thanks to everyone who contributed to this release. Regards Ken Nielson Adaptive Computing From dmneal at wand.net.nz Sun Nov 6 17:59:50 2011 From: dmneal at wand.net.nz (Donald Neal) Date: Mon, 07 Nov 2011 13:59:50 +1300 Subject: [torqueusers] Torque 2.5.9 and not finding pbs_iff Message-ID: <4EB72D86.2090409@wand.net.nz> I've built Torque 2.5.9 using ./configure --libdir=/usr/local/lib --localstatedir=/var --enable-dependency-tracking --with-gnu-ld --with-default-server=symphony.waikato.ac.nz --enable-blcr --with-blcr=/opt/blcr-0.8.4 --enable-cpuset --disable-gui --disable-rpp --enable-server-xml --with-servchkptdir=/scratch-head/checkpoint On replacing the existing version 2.5.7 server install with this and restarting pbs_server, any attempt to communicate with the server process using, say, qsub or qstat, gives an error message like this one: $ qstat -t pbs_iff command not found. qstat: cannot connect to server symphony (errno=15008) pbs_iff command not found, unable to authenticate Now, "which pbs_iff" dutifully reports that the file to be used is /usr/bin/pbs_iff, and yes, privileges on that file are 4755. More interesting, if the user's PATH is set so that /usr/bin is the first element, the problem goes away. The problem comes back if another element is placed in the PATH before /usr/bin, apparently regardless of what that element is. This suggests I've missed something interesting about the way setuid operates in Debian Linux stable. Is there a known change between Torque 2.5.7 and 2.5.9 which could do something relevant to this? - Donald Neal -- Donald Neal |"So, um, nothing rhymes with |infrastructure." - Davis McAlary High Performance Computing | The University of Waikato | From arnaubria at pic.es Mon Nov 7 07:43:01 2011 From: arnaubria at pic.es (Arnau Bria) Date: Mon, 7 Nov 2011 15:43:01 +0100 Subject: [torqueusers] Exit_status=271 In-Reply-To: <4EB40623.7020706@byu.edu> References: <6d9ab215-19cf-4f5b-98dc-89725f0b1445@mail> <4EB40623.7020706@byu.edu> Message-ID: <20111107154301.1278ba40@amarrosa.pic.es> Too much time running... Cheers, Arnau From knielson at adaptivecomputing.com Mon Nov 7 09:12:39 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 07 Nov 2011 09:12:39 -0700 (MST) Subject: [torqueusers] Torque 2.5.9 and not finding pbs_iff In-Reply-To: <4EB72D86.2090409@wand.net.nz> Message-ID: <0eeb4419-a919-471c-b454-c61010e3263b@mail> Donald, How did you install TORQUE? Did you use RPMs? Ken ----- Original Message ----- > From: "Donald Neal" > To: torqueusers at supercluster.org > Sent: Sunday, November 6, 2011 5:59:50 PM > Subject: [torqueusers] Torque 2.5.9 and not finding pbs_iff > > I've built Torque 2.5.9 using > > ./configure --libdir=/usr/local/lib --localstatedir=/var > --enable-dependency-tracking --with-gnu-ld > --with-default-server=symphony.waikato.ac.nz --enable-blcr > --with-blcr=/opt/blcr-0.8.4 --enable-cpuset --disable-gui > --disable-rpp > --enable-server-xml --with-servchkptdir=/scratch-head/checkpoint > > On replacing the existing version 2.5.7 server install with this and > restarting pbs_server, any attempt to communicate with the server > process using, say, qsub or qstat, gives an error message like this > one: > > $ qstat -t > pbs_iff command not found. > qstat: cannot connect to server symphony (errno=15008) pbs_iff > command > not found, unable to authenticate > > Now, "which pbs_iff" dutifully reports that the file to be used is > /usr/bin/pbs_iff, and yes, privileges on that file are 4755. More > interesting, if the user's PATH is set so that /usr/bin is the first > element, the problem goes away. The problem comes back if another > element is placed in the PATH before /usr/bin, apparently regardless > of > what that element is. > > This suggests I've missed something interesting about the way setuid > operates in Debian Linux stable. Is there a known change between > Torque > 2.5.7 and 2.5.9 which could do something relevant to this? > > - Donald Neal > > -- > Donald Neal |"So, um, nothing rhymes with > |infrastructure." - Davis McAlary > High Performance Computing | > The University of Waikato | > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From dmneal at wand.net.nz Mon Nov 7 14:48:00 2011 From: dmneal at wand.net.nz (Donald Neal) Date: Tue, 08 Nov 2011 10:48:00 +1300 Subject: [torqueusers] Torque 2.5.9 and not finding pbs_iff Message-ID: <4EB85210.7070302@wand.net.nz> I've built Torque 2.5.9 using ./configure --libdir=/usr/local/lib --localstatedir=/var --enable-dependency-tracking --with-gnu-ld --with-default-server=symphony.waikato.ac.nz --enable-blcr --with-blcr=/opt/blcr-0.8.4 --enable-cpuset --disable-gui --disable-rpp --enable-server-xml --with-servchkptdir=/scratch-head/checkpoint This is building from source on Debian stable ("squeeze") using ther default compiler, gcc 4.4.5. - Donald Neal -- Donald Neal |"So, um, nothing rhymes with |infrastructure." - Davis McAlary High Performance Computing | The University of Waikato | From knielson at adaptivecomputing.com Mon Nov 7 15:04:17 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 07 Nov 2011 15:04:17 -0700 (MST) Subject: [torqueusers] Torque 2.5.9 and not finding pbs_iff In-Reply-To: <4EB85210.7070302@wand.net.nz> Message-ID: ----- Original Message ----- > From: "Donald Neal" > To: torqueusers at supercluster.org > Sent: Monday, November 7, 2011 2:48:00 PM > Subject: Re: [torqueusers] Torque 2.5.9 and not finding pbs_iff > > I've built Torque 2.5.9 using > > ./configure --libdir=/usr/local/lib --localstatedir=/var > --enable-dependency-tracking --with-gnu-ld > --with-default-server=symphony.waikato.ac.nz --enable-blcr > --with-blcr=/opt/blcr-0.8.4 --enable-cpuset --disable-gui > --disable-rpp > --enable-server-xml --with-servchkptdir=/scratch-head/checkpoint > > This is building from source on Debian stable ("squeeze") using ther > default compiler, gcc 4.4.5. > > - Donald Neal > > -- > Donald Neal |"So, um, nothing rhymes with > |infrastructure." - Davis McAlary > High Performance Computing | > The University of Waikato | > David, I have double checked the tar ball for 2.5.9 and the pbs_iff code is there. I deleted all of the files from /usr/local/sbin where pbs_iff resides and did a make install. pbs_iff was installed. Did you use make packages? Ken From dmneal at wand.net.nz Mon Nov 7 15:59:15 2011 From: dmneal at wand.net.nz (Donald Neal) Date: Tue, 08 Nov 2011 11:59:15 +1300 Subject: [torqueusers] Torque 2.5.9 and not finding pbs_iff Message-ID: <4EB862C3.1080101@wand.net.nz> Ken, Yes, pbs_iff is definitely present. Please note that I originally wrote: > Now, "which pbs_iff" dutifully reports that the file to be used is /usr/bin/pbs_iff, and yes, privileges on that file are 4755. More interesting, if the user's PATH is set so that /usr/bin is the first element, the problem goes away. The problem comes back if another element is placed in the PATH before /usr/bin, apparently regardless of what that element is. > > This suggests I've missed something interesting about the way setuid operates in Debian Linux stable. Is there a known change between Torque 2.5.7 and 2.5.9 which could do something relevant to this? - Donald Neal -- Donald Neal |"So, um, nothing rhymes with |infrastructure." - Davis McAlary High Performance Computing | The University of Waikato | From zhaohscas at yahoo.com.cn Tue Nov 8 08:05:56 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Tue, 08 Nov 2011 23:05:56 +0800 Subject: [torqueusers] The issues on using -o option. Message-ID: <4EB94554.1050208@yahoo.com.cn> Hi all, From the following webpage, http://www.clusterresources.com/torquedocs/commands/qsub.shtml I can find the following notes: ------------- -o path Defines the path to be used for the standard output stream of the batch job. The path argument is of the form: [hostname:]path_name ------------- Accordingly, in my pbs script, I use the following line: ---------- #PBS -o out ---------- Then, in the result folder of my job, I obtained a file named out with the following file permissions: zhaohongsheng at node32:~/Castep_test> ls -la out -rw------- 1 zhaohongsheng users 0 2011-12-13 11:32 out As you can see, the file permission is 600 for this file. Why does this happen? I think it should be 644 for convenience. Any hints on this issue? Thanks in advance. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From sheen at usc.edu Tue Nov 8 08:22:32 2011 From: sheen at usc.edu (David Sheen) Date: Tue, 8 Nov 2011 10:22:32 -0500 Subject: [torqueusers] The issues on using -o option. In-Reply-To: <4EB94554.1050208@yahoo.com.cn> References: <4EB94554.1050208@yahoo.com.cn> Message-ID: Is this an issue with your umask settings rather than with torque? On Tue, Nov 8, 2011 at 10:05 AM, Hongsheng Zhao wrote: > Hi all, > > From the following webpage, > > http://www.clusterresources.com/torquedocs/commands/qsub.shtml > > I can find the following notes: > > ------------- > -o path > Defines the path to be used for the standard output stream of the > batch job. The path argument is of the form: > > [hostname:]path_name > ------------- > > Accordingly, in my pbs script, I use the following line: > > ---------- > #PBS -o out > ---------- > > Then, in the result folder of my job, I obtained a file named out with > the following file permissions: > > zhaohongsheng at node32:~/Castep_test> ls -la out > -rw------- 1 zhaohongsheng users 0 2011-12-13 11:32 out > > As you can see, the file permission is 600 for this file. Why does this > happen? I think it should be 644 for convenience. Any hints on this > issue? Thanks in advance. > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111108/add70aa4/attachment.html From akohlmey at cmm.chem.upenn.edu Tue Nov 8 09:04:55 2011 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Tue, 8 Nov 2011 11:04:55 -0500 Subject: [torqueusers] The issues on using -o option. In-Reply-To: <4EB94554.1050208@yahoo.com.cn> References: <4EB94554.1050208@yahoo.com.cn> Message-ID: On Tue, Nov 8, 2011 at 10:05 AM, Hongsheng Zhao wrote: > Hi all, > > ?From the following webpage, > > http://www.clusterresources.com/torquedocs/commands/qsub.shtml > > I can find the following notes: > > ------------- > -o path > ? ? Defines the path to be used for the standard output stream of the > batch job. The path argument is of the form: > > ? ? ? ? [hostname:]path_name > ------------- > > Accordingly, in my pbs script, I use the following line: > > ---------- > ?#PBS -o out > ---------- > > Then, in the result folder of my job, I obtained a file named out with > the following file permissions: > > zhaohongsheng at node32:~/Castep_test> ls -la out > -rw------- 1 zhaohongsheng users 0 2011-12-13 11:32 out > > As you can see, the file permission is 600 for this file. ?Why does this > happen? ?I think it should be 644 for convenience. ?Any hints on this > issue? ?Thanks in advance. not everybody likes god and the world to read their outputs, so this is a reasonable default. but since you set the name of your output, nothing keeps you from using "chmod 0644 out" in you submit script to change the permissions to what you like them to be. axel. > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From knielson at adaptivecomputing.com Tue Nov 8 09:27:41 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 08 Nov 2011 09:27:41 -0700 (MST) Subject: [torqueusers] The issues on using -o option. In-Reply-To: <4EB94554.1050208@yahoo.com.cn> Message-ID: <18d4d467-cc48-4e08-8964-42d37015fab1@mail> ----- Original Message ----- > From: "Hongsheng Zhao" > To: torqueusers at supercluster.org > Sent: Tuesday, November 8, 2011 8:05:56 AM > Subject: [torqueusers] The issues on using -o option. > > Hi all, > > From the following webpage, > > http://www.clusterresources.com/torquedocs/commands/qsub.shtml > > I can find the following notes: > > ------------- > -o path > Defines the path to be used for the standard output stream of > the > batch job. The path argument is of the form: > > [hostname:]path_name > ------------- > > Accordingly, in my pbs script, I use the following line: > > ---------- > #PBS -o out > ---------- > > Then, in the result folder of my job, I obtained a file named out > with > the following file permissions: > > zhaohongsheng at node32:~/Castep_test> ls -la out > -rw------- 1 zhaohongsheng users 0 2011-12-13 11:32 out > > As you can see, the file permission is 600 for this file. Why does > this > happen? I think it should be 644 for convenience. Any hints on this > issue? Thanks in advance. > > Regards > -- > Hongsheng Zhao > School of Physics and Electrical Information Science, > Ningxia University, Yinchuan 750021, China > _______________________________________________ Honsheng, You are right. That is the permission being set. However, the file is owned by the user who was running the job. It is appropriate that only that user have access to the data. Regards Ken From dbeer at adaptivecomputing.com Tue Nov 8 09:28:06 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 08 Nov 2011 09:28:06 -0700 (MST) Subject: [torqueusers] The issues on using -o option. In-Reply-To: Message-ID: ----- Original Message ----- > On Tue, Nov 8, 2011 at 10:05 AM, Hongsheng Zhao > wrote: > > Hi all, > > > > ?From the following webpage, > > > > http://www.clusterresources.com/torquedocs/commands/qsub.shtml > > > > I can find the following notes: > > > > ------------- > > -o path > > ? ? Defines the path to be used for the standard output stream of > > ? ? the > > batch job. The path argument is of the form: > > > > ? ? ? ? [hostname:]path_name > > ------------- > > > > Accordingly, in my pbs script, I use the following line: > > > > ---------- > > ?#PBS -o out > > ---------- > > > > Then, in the result folder of my job, I obtained a file named out > > with > > the following file permissions: > > > > zhaohongsheng at node32:~/Castep_test> ls -la out > > -rw------- 1 zhaohongsheng users 0 2011-12-13 11:32 out > > > > As you can see, the file permission is 600 for this file. ?Why does > > this > > happen? ?I think it should be 644 for convenience. ?Any hints on > > this > > issue? ?Thanks in advance. > > not everybody likes god and the world to read their outputs, > so this is a reasonable default. > > but since you set the name of your output, nothing keeps > you from using "chmod 0644 out" in you submit script to > change the permissions to what you like them to be. > > axel. > > > > > > Regards > > -- > > Hongsheng Zhao > > School of Physics and Electrical Information Science, > > Ningxia University, Yinchuan 750021, China > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > -- > Dr. Axel Kohlmeyer? ? akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > You can set a different default permissions using $job_output_file_umask, which is documented on this page http://www.adaptivecomputing.com/resources/docs/torque/a.cmomconfig.php Cheers, -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From zhaohscas at yahoo.com.cn Tue Nov 8 20:49:55 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 09 Nov 2011 11:49:55 +0800 Subject: [torqueusers] The issues on using -o option. In-Reply-To: References: Message-ID: <4EB9F863.6050804@yahoo.com.cn> On 11/09/2011 12:28 AM, David Beer wrote: > > > ----- Original Message ----- >> On Tue, Nov 8, 2011 at 10:05 AM, Hongsheng Zhao >> wrote: >>> Hi all, >>> >>> From the following webpage, >>> >>> http://www.clusterresources.com/torquedocs/commands/qsub.shtml >>> >>> I can find the following notes: >>> >>> ------------- >>> -o path >>> Defines the path to be used for the standard output stream of >>> the >>> batch job. The path argument is of the form: >>> >>> [hostname:]path_name >>> ------------- >>> >>> Accordingly, in my pbs script, I use the following line: >>> >>> ---------- >>> #PBS -o out >>> ---------- >>> >>> Then, in the result folder of my job, I obtained a file named out >>> with >>> the following file permissions: >>> >>> zhaohongsheng at node32:~/Castep_test> ls -la out >>> -rw------- 1 zhaohongsheng users 0 2011-12-13 11:32 out >>> >>> As you can see, the file permission is 600 for this file. Why does >>> this >>> happen? I think it should be 644 for convenience. Any hints on >>> this >>> issue? Thanks in advance. >> >> not everybody likes god and the world to read their outputs, >> so this is a reasonable default. >> >> but since you set the name of your output, nothing keeps >> you from using "chmod 0644 out" in you submit script to >> change the permissions to what you like them to be. >> >> axel. >> >> >>> >>> Regards >>> -- >>> Hongsheng Zhao >>> School of Physics and Electrical Information Science, >>> Ningxia University, Yinchuan 750021, China >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> >> >> -- >> Dr. Axel Kohlmeyer akohlmey at gmail.com >> http://sites.google.com/site/akohlmey/ >> >> Institute for Computational Molecular Science >> Temple University, Philadelphia PA, USA. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > You can set a different default permissions using $job_output_file_umask, which is documented on this page http://www.adaptivecomputing.com/resources/docs/torque/a.cmomconfig.php Thanks a lot, I've read the following notes for the above $job_output_file_umask parameter from the webpage you mentioned above: ---------- $job_output_file_umask Format: Description: uses the specified umask when creating job output and error files. Values can be specified in base 8, 10, or 16; leading 0 implies octal and leading 0x or 0X hexadecimal. A value of "userdefault" will use the user's default umask. This parameter is in version 2.3.0 and later. Example: $job_output_file_umask 027 ----------- But, anyway, I'm a newbie of torque. I cann't figure out how should I use the this command for my case. I've tried use the following line in my pbs script: $job_output_file_umask 027 But, I finally find the following content in the result out file: ------------ zhaohongsheng at node32:~/Castep_test> cat out /opt/gridview/pbs/dispatcher/mom_priv/jobs/467.node32.nxu.edu.cn.SC: line 14: 027: command not found ------------- Could you please give me some hints? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From gabe at msi.umn.edu Tue Nov 8 21:57:59 2011 From: gabe at msi.umn.edu (Gabe Turner) Date: Tue, 8 Nov 2011 22:57:59 -0600 Subject: [torqueusers] The issues on using -o option. In-Reply-To: <4EB9F863.6050804@yahoo.com.cn> References: <4EB9F863.6050804@yahoo.com.cn> Message-ID: <20111109045758.GK29886@blackice.msi.umn.edu> On Wed, Nov 09, 2011 at 11:49:55AM +0800, Hongsheng Zhao wrote: > > You can set a different default permissions using $job_output_file_umask, which is documented on this page http://www.adaptivecomputing.com/resources/docs/torque/a.cmomconfig.php > > Thanks a lot, I've read the following notes for the above > $job_output_file_umask parameter from the webpage you mentioned above: > > ---------- > $job_output_file_umask > Format: > Description: uses the specified umask when creating job output and > error files. Values can be specified in base 8, 10, or 16; leading 0 > implies octal and leading 0x or 0X hexadecimal. A value of "userdefault" > will use the user's default umask. This parameter is in version 2.3.0 > and later. > Example: $job_output_file_umask 027 > ----------- > > But, anyway, I'm a newbie of torque. I cann't figure out how should I > use the this command for my case. > > I've tried use the following line in my pbs script: > > $job_output_file_umask 027 > You need to put it in each mom's config. It's usually at /var/spool/torque/mom_priv/config on each compute node. HTH, -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From zhaohscas at yahoo.com.cn Wed Nov 9 00:05:12 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 09 Nov 2011 15:05:12 +0800 Subject: [torqueusers] The issues on using -o option. In-Reply-To: <20111109045758.GK29886@blackice.msi.umn.edu> References: <4EB9F863.6050804@yahoo.com.cn> <20111109045758.GK29886@blackice.msi.umn.edu> Message-ID: <4EBA2628.5090109@yahoo.com.cn> On 11/09/2011 12:57 PM, Gabe Turner wrote: > You need to put it in each mom's config. It's usually at > /var/spool/torque/mom_priv/config on each compute node. I use suse, see the following for detail: ----- zhaohongsheng at node32:~/Castep_test> cat /etc/issue Welcome to SUSE Linux Enterprise Server 10 SP2 (x86_64) - Kernel \r (\l). -------- There aren't the /var/spool/torque and /var/spool/torque/mom_priv/ directories at all in my case. Any hints? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From gabe at msi.umn.edu Wed Nov 9 08:24:31 2011 From: gabe at msi.umn.edu (Gabe Turner) Date: Wed, 9 Nov 2011 09:24:31 -0600 Subject: [torqueusers] The issues on using -o option. In-Reply-To: <4EBA2628.5090109@yahoo.com.cn> References: <4EB9F863.6050804@yahoo.com.cn> <20111109045758.GK29886@blackice.msi.umn.edu> <4EBA2628.5090109@yahoo.com.cn> Message-ID: <20111109152431.GA18380@blackice.msi.umn.edu> On Wed, Nov 09, 2011 at 03:05:12PM +0800, Hongsheng Zhao wrote: > On 11/09/2011 12:57 PM, Gabe Turner wrote: > > You need to put it in each mom's config. It's usually at > > /var/spool/torque/mom_priv/config on each compute node. > > I use suse, see the following for detail: > > ----- > zhaohongsheng at node32:~/Castep_test> cat /etc/issue > > Welcome to SUSE Linux Enterprise Server 10 SP2 (x86_64) - Kernel \r (\l). > -------- > > There aren't the /var/spool/torque and /var/spool/torque/mom_priv/ > directories at all in my case. Any hints? > The spool directory could be in another location. If there is a torque-mom rpm, see the output of 'rpm -ql torque-mom' and look for the mom_priv directory. It's usually somewhere in /var/spool, in my experience, unless the configured paths are extremely atypical. -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From knielson at adaptivecomputing.com Wed Nov 9 17:08:32 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 09 Nov 2011 17:08:32 -0700 (MST) Subject: [torqueusers] Fwd: [torquedev] TORQUE dinner at SuperComputing In-Reply-To: Message-ID: <470f02dc-f4ad-468a-81ab-6fc139548c7b@mail> ----- Forwarded Message ----- From: "Ken Nielson" To: "Torque Developers mailing list" Sent: Monday, November 7, 2011 10:33:03 AM Subject: [torquedev] TORQUE dinner at SuperComputing Hi all, I am hoping to continue a tradition of meeting with TORQUE developers and users for dinner while at SuperComputing this year. Is there anyone interested? Ken _______________________________________________ torquedev mailing list torquedev at supercluster.org http://www.supercluster.org/mailman/listinfo/torquedev From cholam20 at yahoo.co.in Wed Nov 9 19:04:07 2011 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Thu, 10 Nov 2011 07:34:07 +0530 (IST) Subject: [torqueusers] try it out for yourself... Message-ID: <1320890647.75606.androidMobile@web137317.mail.in.yahoo.com>

Hello friend...
honestly this literally could be the best decision of your life
http://www.ezflash.cn/jump.php?revov&36mas=mail.com&36bupe=facebook.com&url=http://daily7-business.ru/profile
bye

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111110/f63542b3/attachment.html From miguel.gila at cscs.ch Thu Nov 10 08:37:48 2011 From: miguel.gila at cscs.ch (Gila Arrondo Miguel Angel) Date: Thu, 10 Nov 2011 15:37:48 +0000 Subject: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox Message-ID: <74E21B4E-4520-4BD9-A4BB-5D7CF8DC0BE4@cscs.ch> Hi all, We are seeing a lot of "pbs_mom: scp" transfer errors in our /var/log/messages, but the files mentioned in these errors are there and are accessible. This is an example of error: wn113: Nov 9 14:54:57 wn113 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB out_cre02_208919067_StandardOutput cms079 at cream02.lcg.cscs.ch:/cream_localsandbox/data/cms/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_eaguiloc_CN_555092_CN_Ernest_Aguilo_Chivite_cms_Role_NULL_Capability_NULL_cms079/20/CREAM208919067/StandardOutput' failed with status=1, giving up after 4 attempts These kind of errors happen everyday without any specific correlation to cron jobs or any other cfengine tasks done on a regular scheduled base. Here is a summary them in all the WNs in our cluster. Total: 12 in cream02 on Nov 10 Total: 2 in cream01 on Nov 10 Total: 72 in cream02 on Nov 09 Total: 74 in cream01 on Nov 09 Total: 52 in cream02 on Nov 08 Total: 2 in cream01 on Nov 08 Total: 212 in cream02 on Nov 07 Total: 36 in cream01 on Nov 07 Total: 1240 in cream02 on Nov 06 Total: 465 in cream01 on Nov 06 At the moment there are two CREAM-CEs (the endpoint host of these scp transfers), one is a VM (cream01) and the other is a physical machine (cream02), each with its own local cream_sandbox directory (endpoint location of the scp transfers) and enough computing power to attend ssh connections and all the rest of the CREAM services. Initially we had the cream_sandbox shared through a Lustre filesystem, but since it was unreliable and became very slow at times (jobs ran there), we decided to move it to the local disk. These issues did not happen before: since the sandbox was shared, we used regular $usecp We are aware that you can tune this with the directive $rcpcmd in the config file of pbs_mom, but since we are not sure what the error may be, we don't know what to change in the settings. The value of MaxStartups in /etc/ssh/sshd_config is set to 20000 > MaxStartups 20000 We've checked the /var/log/secure for scp errors, but everything seems to be ok there. Any idea on what could be wrong?? Thanks in advance, Miguel -- Miguel Gila CSCS Swiss National Supercomputing Centre HPC Solutions Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111110/f9436ec9/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3239 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111110/f9436ec9/attachment-0001.bin From kenneth at sdsc.edu Thu Nov 10 16:55:44 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Thu, 10 Nov 2011 15:55:44 -0800 (PST) Subject: [torqueusers] prologue.parallel? Message-ID: I'm running Torque 3.0.2 I'm trying to use a prologue.parallel, but it doesn't seem to ever get invoked. Tried looking in Torque src, but I don't see where it gets started. Where is the code that starts it? Thanks, Kenneth From r.oostenveld at donders.ru.nl Fri Nov 11 01:41:43 2011 From: r.oostenveld at donders.ru.nl (Robert Oostenveld) Date: Fri, 11 Nov 2011 09:41:43 +0100 Subject: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox In-Reply-To: <74E21B4E-4520-4BD9-A4BB-5D7CF8DC0BE4@cscs.ch> References: <74E21B4E-4520-4BD9-A4BB-5D7CF8DC0BE4@cscs.ch> Message-ID: <8643FD43-67BA-4EE0-AA89-08975325E6AD@donders.ru.nl> Dear Miguel, On 10 Nov 2011, at 16:37, Gila Arrondo Miguel Angel wrote: > We are seeing a lot of "pbs_mom: scp" transfer errors in our /var/log/messages, but the files mentioned in these errors are there and are accessible. > > This is an example of error: > ... I don't know whether it may be related, but we have also had a problem with scp which I tracked down to the users having an incorrect (i.e. not up-to-date) .ssh/known_hosts file in ther NFS shared home directory. We have many non-torque-cluster linux computers from which jobs can be submitted, and sometimes these are updated/reinstalled which invalidates the ssh host key that was previously assigned to their IP address. The users that had a correct known_hosts or that had specified StrictHostKeyChecking=no in their .ssh/config file did not have any problems, but the users that had an outdated known_hosts did encounter problems (but then only when submitting from one of the nodes where the host key had changed). The consequence was that on the torque execute hosts scp would look into the user's known_hosts, and depending from where the job was submitted would find a correct (for some users) or an incorrect (for other users) host key for the submit client. A possible solution would have been to ensure that all users' known_hosts was correct or that all users had StrictHostKeyChecking=no. But our specific solution was to specify $usecp *:/home /home in the /var/spool/torque/mom_priv/config, as scp was not needed anyway because of our shared NFS home directory. best regards, Robert ----------------------------------------------------------- Robert Oostenveld, PhD Senior Researcher & MEG Physicist Donders Institute for Brain, Cognition and Behaviour Centre for Cognitive Neuroimaging Radboud University Nijmegen tel.: +31 (0)24 3619695 e-mail: r.oostenveld at donders.ru.nl web: http://www.ru.nl/neuroimaging skype: r.oostenveld ----------------------------------------------------------- From knielson at adaptivecomputing.com Fri Nov 11 06:35:32 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 11 Nov 2011 06:35:32 -0700 (MST) Subject: [torqueusers] TORQUE dinner at SuperComputing In-Reply-To: Message-ID: ----- Original Message ----- > From: "Ken Nielson" > To: "Torque Developers mailing list" > Sent: Monday, November 7, 2011 10:33:03 AM > Subject: TORQUE dinner at SuperComputing > > Hi all, > > I am hoping to continue a tradition of meeting with TORQUE developers > and users for dinner while at SuperComputing this year. Is there > anyone interested? > > Ken It looks like Tuesday is the best night. So we will plan on dinner Tuesday night. Is 7:00 too late? Any suggestions of where to eat? Ken From JMRUSHTON at qinetiq.com Fri Nov 11 07:24:36 2011 From: JMRUSHTON at qinetiq.com (Rushton Martin) Date: Fri, 11 Nov 2011 14:24:36 -0000 Subject: [torqueusers] UC prologue.parallel? In-Reply-To: References: Message-ID: <20111111142402.786B383A802B@mail.adaptivecomputing.com> I had some fun with this a while back. prologue.parallel runs on a sister node, not on the Mother Superior. Its stdout and stderr are NOT connected to the job's stdout and stderr, so any output is lost. You need to route the output to a file and then look at the file to see what is happening. Remember also that prologue.parallel is running as root, you need prologue.user.parallel if you want to run as the user. Watch the permissions also, 500 for the former and 555 for the latter. If any root run pro/epi has other:w, then it will not run. Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrushton at QinetiQ.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Kenneth Yoshimoto Sent: 10 November 2011 23:56 To: torqueusers at supercluster.org Subject: [torqueusers] prologue.parallel? I'm running Torque 3.0.2 I'm trying to use a prologue.parallel, but it doesn't seem to ever get invoked. Tried looking in Torque src, but I don't see where it gets started. Where is the code that starts it? Thanks, Kenneth _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. From leggett at mcs.anl.gov Fri Nov 11 08:35:45 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 11 Nov 2011 09:35:45 -0600 Subject: [torqueusers] Running either GPGPU or GL GPU jobs on nodes Message-ID: We have NV GPUs and we have some users who want to run GPGPU jobs (like CUDA) and we have other users who want to run GL GPU jobs. GL jobs require the machine to have X started (runlevel 5) and GPGPU jobs can't run when X is running. Does anyone have a good method of letting users specify which type of GPU job they need to run and changing the runlevel appropriately? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111111/8dde0e3a/attachment.bin From akohlmey at cmm.chem.upenn.edu Fri Nov 11 09:18:35 2011 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Fri, 11 Nov 2011 11:18:35 -0500 Subject: [torqueusers] Running either GPGPU or GL GPU jobs on nodes In-Reply-To: References: Message-ID: On Fri, Nov 11, 2011 at 10:35 AM, Ti Leggett wrote: > We have NV GPUs and we have some users who want to run GPGPU jobs (like CUDA) and we have other users who want to run GL GPU jobs. GL jobs require the machine to have X started (runlevel 5) and GPGPU jobs can't run when X is running. Does anyone have a good method of letting users specify which type of GPU job they need to run and changing the runlevel appropriately? with nvidia hardware GPGPU jobs _can_ run when X is running. i am doing that on my desktop all the time. you may need to tweak the timeout that is set to keep GPGPU applications from hogging the GPU when X is running, if your GPGPU users write kernels that run excessively long. in most cases, that is just bad program design. axel. > -----BEGIN PGP SIGNATURE----- > > iEYEARECAAYFAk69QNEACgkQ4RgdOxQVi0DwCQCfSGsUD+/h2wfhPUeuI9k8i8lf > ScIAnAp3crBjAdQ/keek1ZuEKqbidqSq > =BmBW > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From gus at ldeo.columbia.edu Fri Nov 11 09:46:56 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Fri, 11 Nov 2011 11:46:56 -0500 Subject: [torqueusers] Running either GPGPU or GL GPU jobs on nodes In-Reply-To: References: Message-ID: On Nov 11, 2011, at 11:18 AM, Axel Kohlmeyer wrote: > On Fri, Nov 11, 2011 at 10:35 AM, Ti Leggett wrote: >> We have NV GPUs and we have some users who want to run GPGPU jobs (like CUDA) and we have other users who want to run GL GPU jobs. GL jobs require the machine to have X started (runlevel 5) and GPGPU jobs can't run when X is running. Does anyone have a good method of letting users specify which type of GPU job they need to run and changing the runlevel appropriately? > > with nvidia hardware GPGPU jobs _can_ run when X > is running. i am doing that on my desktop all the time. > you may need to tweak the timeout that is set to > keep GPGPU applications from hogging the GPU > when X is running, if your GPGPU users write kernels > that run excessively long. in most cases, that is > just bad program design. > > axel. > Hi Ti I guess you don't want to let users change the machine runlevel. However, I presume you could check the if the job requires X and change the runlevel in a prologue script, then return to runlevel 3 in an epilogue script at the end of the job. I suppose you could identify the GL_GPU jobs if you associate them to a node property, e.g. it could be named GL_GPU and added to the appropriate nodes in the server_priv/nodes file. Then the user would request nodes with the 'GL_GPU' property on her/his Torque qsub script/command line, which your preamble script could then deal with by changing runlevel to 5. Just a suggestion. Gus Correa > >> -----BEGIN PGP SIGNATURE----- >> >> iEYEARECAAYFAk69QNEACgkQ4RgdOxQVi0DwCQCfSGsUD+/h2wfhPUeuI9k8i8lf >> ScIAnAp3crBjAdQ/keek1ZuEKqbidqSq >> =BmBW >> -----END PGP SIGNATURE----- >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From leggett at mcs.anl.gov Fri Nov 11 10:07:55 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 11 Nov 2011 11:07:55 -0600 Subject: [torqueusers] Running either GPGPU or GL GPU jobs on nodes In-Reply-To: References: Message-ID: Can you run CUDA gdb while X is running. I have a user trying to do this and this is the error they're getting: "error: All CUDA devices are used for X11 and cannot be used while debugging." On Nov 11, 2011, at 10:46 AM, Gustavo Correa wrote: > > On Nov 11, 2011, at 11:18 AM, Axel Kohlmeyer wrote: > >> On Fri, Nov 11, 2011 at 10:35 AM, Ti Leggett wrote: >>> We have NV GPUs and we have some users who want to run GPGPU jobs (like CUDA) and we have other users who want to run GL GPU jobs. GL jobs require the machine to have X started (runlevel 5) and GPGPU jobs can't run when X is running. Does anyone have a good method of letting users specify which type of GPU job they need to run and changing the runlevel appropriately? >> >> with nvidia hardware GPGPU jobs _can_ run when X >> is running. i am doing that on my desktop all the time. >> you may need to tweak the timeout that is set to >> keep GPGPU applications from hogging the GPU >> when X is running, if your GPGPU users write kernels >> that run excessively long. in most cases, that is >> just bad program design. >> >> axel. >> > > Hi Ti > > I guess you don't want to let users change the machine runlevel. > However, I presume you could check the if the job requires X and change the runlevel > in a prologue script, > then return to runlevel 3 in an epilogue script at the end of the job. > > I suppose you could identify the GL_GPU jobs if you associate them > to a node property, e.g. it could be named GL_GPU and added to the appropriate nodes in the server_priv/nodes file. > Then the user would request nodes with the 'GL_GPU' property on her/his Torque qsub > script/command line, which your preamble script could then deal with by changing runlevel > to 5. > > Just a suggestion. > > Gus Correa > >> >>> -----BEGIN PGP SIGNATURE----- >>> >>> iEYEARECAAYFAk69QNEACgkQ4RgdOxQVi0DwCQCfSGsUD+/h2wfhPUeuI9k8i8lf >>> ScIAnAp3crBjAdQ/keek1ZuEKqbidqSq >>> =BmBW >>> -----END PGP SIGNATURE----- >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> >> >> -- >> Dr. Axel Kohlmeyer akohlmey at gmail.com >> http://sites.google.com/site/akohlmey/ >> >> Institute for Computational Molecular Science >> Temple University, Philadelphia PA, USA. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111111/cffa384c/attachment.bin From akohlmey at cmm.chem.upenn.edu Fri Nov 11 10:41:56 2011 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Fri, 11 Nov 2011 12:41:56 -0500 Subject: [torqueusers] Running either GPGPU or GL GPU jobs on nodes In-Reply-To: References: Message-ID: On Fri, Nov 11, 2011 at 12:07 PM, Ti Leggett wrote: > Can you run CUDA gdb while X is running. I have a user trying to do this and this is the error they're getting: > > "error: All CUDA devices are used for X11 and cannot be used while debugging." no. i never even tried using cuda-gdb, but from what i know about how it is supposed to work, this is likely a case where you have to have a long-lived kernel and then it will collide with using X at the same time. the error message confirms that. question is, does a single person needing to do some debugging require a queue reconfiguration? unless this happens on a regular basis. i would just set up a reservation for this user and then turn off X on that/those node(s) for the time being. axel. > > On Nov 11, 2011, at 10:46 AM, Gustavo Correa wrote: > >> >> On Nov 11, 2011, at 11:18 AM, Axel Kohlmeyer wrote: >> >>> On Fri, Nov 11, 2011 at 10:35 AM, Ti Leggett wrote: >>>> We have NV GPUs and we have some users who want to run GPGPU jobs (like CUDA) and we have other users who want to run GL GPU jobs. GL jobs require the machine to have X started (runlevel 5) and GPGPU jobs can't run when X is running. Does anyone have a good method of letting users specify which type of GPU job they need to run and changing the runlevel appropriately? >>> >>> with nvidia hardware GPGPU jobs _can_ run when X >>> is running. i am doing that on my desktop all the time. >>> you may need to tweak the timeout that is set to >>> keep GPGPU applications from hogging the GPU >>> when X is running, if your GPGPU users write kernels >>> that run excessively long. in most cases, that is >>> just bad program design. >>> >>> axel. >>> >> >> Hi Ti >> >> I guess you don't want to let users change the machine runlevel. >> However, I presume you could check the if the job requires X and change the runlevel >> in a prologue script, >> then return to runlevel 3 in an epilogue script at the end of the job. >> >> I suppose you could identify the GL_GPU jobs if you associate them >> to a node property, e.g. it could be named GL_GPU and added to the appropriate nodes in the server_priv/nodes file. >> Then the user would request ?nodes with the 'GL_GPU' property on her/his Torque qsub >> script/command line, which your preamble script could then deal with by changing runlevel >> to 5. >> >> Just a suggestion. >> >> Gus Correa >> >>> >>>> -----BEGIN PGP SIGNATURE----- >>>> >>>> iEYEARECAAYFAk69QNEACgkQ4RgdOxQVi0DwCQCfSGsUD+/h2wfhPUeuI9k8i8lf >>>> ScIAnAp3crBjAdQ/keek1ZuEKqbidqSq >>>> =BmBW >>>> -----END PGP SIGNATURE----- >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> >>> >>> -- >>> Dr. Axel Kohlmeyer ? ?akohlmey at gmail.com >>> http://sites.google.com/site/akohlmey/ >>> >>> Institute for Computational Molecular Science >>> Temple University, Philadelphia PA, USA. >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > -----BEGIN PGP SIGNATURE----- > > iEYEARECAAYFAk69VmsACgkQ4RgdOxQVi0AAgACdGOJSDr2lTjYc446hHdDvoxW+ > Ik4An2ZJFEAtY9jTHVvJe1dkDuoUQwHt > =i2x0 > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From jwbacon at tds.net Fri Nov 11 11:39:51 2011 From: jwbacon at tds.net (Jason Bacon) Date: Fri, 11 Nov 2011 12:39:51 -0600 Subject: [torqueusers] Interactive jobs not working Message-ID: <4EBD6BF7.60600@tds.net> Hi all, I'm having some trouble with interactive jobs on both torque 2.5.8 and 3.0.2. If I submit a script such as this one: ========== #!/bin/sh #PBS -N hostname #PBS -I hostname ========== using "qsub hostname.pbs", I get the following email: Message 1: From adm at hpc1.cs.uwm.edu Fri Nov 11 12:17:27 2011 Date: Fri, 11 Nov 2011 12:17:26 -0600 (CST) From: adm at hpc1.cs.uwm.edu To: bacon at hpc1.cs.uwm.edu Subject: PBS JOB 47.hpc1.cs.uwm.edu PBS Job Id: 47.hpc1.cs.uwm.edu Job Name: hostname Exec host: hpc1-1/0 Aborted by PBS Server Job cannot be executed See Administrator for help and the following appears in the mom log on the compute node: 11/11/2011 12:19:20;0001; pbs_mom;Job;47.hpc1.cs.uwm.edu;phase 2 of job launch successfully completed 11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;Job 47.hpc1.cs.uwm.edu read start return code=-1 session=0 11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, before files staged, no retry (see syslog for more information) Syslog: Nov 11 12:19:20 hpc1-1 pbs_mom: LOG_ERROR::Connection refused (61) in TMomFinalizeChild, cannot open interactive qsub socket to host hpc1.cs.uwm.edu:28169 - 'cannot connect to port 777 in client_to_svr - connection refused' - check routing tables/multi-homed host issues The full log for this job (loglevel=3) is included below. If I use -I on the command line instead of in the script, it drops me into an interactive shell on the scheduled node instead of running the script: ============== FreeBSD hpc1 bacon ~ 454: qsub -I hostname.pbs qsub: waiting for job 50.hpc1.cs.uwm.edu to start qsub: job 50.hpc1.cs.uwm.edu ready 12:25PM up 49 days, 1:50, 0 users, load averages: 0.00, 0.00, 0.00 FreeBSD hpc1-1 bacon ~ 401: ============== Batch, array, and MPI jobs all work fine. The only other issue I've seen is that the output does not appear in /var/spool/torque/spool during job execution. Output files are created when the job finishes, so it must be storing it somewhere else temporarily. Not sure if this is a program, configuration, or documentation issue, or whether it has any relation to the interactive jobs issue. I can work around this one by configuring with --disable-spool. In that case, partial output shows up in the user's home directory as expected. I've been Googling for a while on this, and found several others with similar issues, but no solution that works for me. I'm testing this on a small cluster with no multi-homed hosts and no firewall. Any hints about what the root cause might be would be appreciated. Thanks, Jason Full mom log: 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0100; pbs_mom;Req;;Type QueueJob request received from PBS_Server at hpc1.cs.uwm.edu, sock=11 11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type QueueJob from host hpc1.cs.uwm.edu allowed 11/11/2011 12:19:20;0100; pbs_mom;Req;;Type JobScript request received from PBS_Server at hpc1.cs.uwm.edu, sock=11 11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type JobScript from host hpc1.cs.uwm.edu allowed 11/11/2011 12:19:20;0100; pbs_mom;Req;;Type ReadyToCommit request received from PBS_Server at hpc1.cs.uwm.edu, sock=11 11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type ReadyToCommit from host hpc1.cs.uwm.edu allowed 11/11/2011 12:19:20;0100; pbs_mom;Req;;Type Commit request received from PBS_Server at hpc1.cs.uwm.edu, sock=11 11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type Commit from host hpc1.cs.uwm.edu allowed 11/11/2011 12:19:20;0001; pbs_mom;Job;job_nodes;job: 47.hpc1.cs.uwm.edu numnodes=1 numvnod=1 11/11/2011 12:19:20;0001; pbs_mom;Job;47.hpc1.cs.uwm.edu;phase 2 of job launch successfully completed 11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;Job 47.hpc1.cs.uwm.edu read start return code=-1 session=0 11/11/2011 12:19:20;0001; pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec failure, before files staged, no retry (see syslog for more information) 11/11/2011 12:19:20;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 47.hpc1.cs.uwm.edu 11/11/2011 12:19:20;0080; pbs_mom;Svr;scan_for_exiting;searching for exiting jobs 11/11/2011 12:19:20;0008; pbs_mom;Job;kill_job;scan_for_exiting: sending signal 9, "KILL" to job 47.hpc1.cs.uwm.edu, reason: local task termination detected 11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;sending preobit jobstat 11/11/2011 12:19:20;0008; pbs_mom;Job;scan_for_terminated;pid 53916 not tracked, exitcode=254 11/11/2011 12:19:20;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 11/11/2011 12:19:20;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 11/11/2011 12:19:20;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;performing job clean-up in preobit_reply() 11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;epilog subtask created with pid 53917 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue 11/11/2011 12:19:20;0100; pbs_mom;Req;;Type StatusJob request received from PBS_Server at hpc1.cs.uwm.edu, sock=11 11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type StatusJob from host hpc1.cs.uwm.edu allowed 11/11/2011 12:19:20;0008; pbs_mom;Job;47.hpc1.cs.uwm.edu;checking job post-processing routine 11/11/2011 12:19:20;0080; pbs_mom;Req;post_epilogue;preparing obit message for job 47.hpc1.cs.uwm.edu 11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;obit sent to server 11/11/2011 12:19:20;0100; pbs_mom;Req;;Type DeleteJob request received from PBS_Server at hpc1.cs.uwm.edu, sock=11 11/11/2011 12:19:20;0008; pbs_mom;Job;process_request;request type DeleteJob from host hpc1.cs.uwm.edu allowed 11/11/2011 12:19:20;0008; pbs_mom;Job;47.hpc1.cs.uwm.edu;deleting job 11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;deleting job 47.hpc1.cs.uwm.edu in state EXITED 11/11/2011 12:19:20;0080; pbs_mom;Job;47.hpc1.cs.uwm.edu;removed job script 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:20;0002; pbs_mom;n/a;rm_request;setting alarm in rm_request 11/11/2011 12:19:57;0002; pbs_mom;n/a;setmax;setmax: dev /dev/ttyu0 access 1316795691 replaces max 0 11/11/2011 12:19:57;0002; pbs_mom;n/a;mom_server_update_stat;status update successfully sent to hpc1 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From leggett at mcs.anl.gov Fri Nov 11 14:48:00 2011 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 11 Nov 2011 15:48:00 -0600 Subject: [torqueusers] Running either GPGPU or GL GPU jobs on nodes In-Reply-To: References: Message-ID: I like the simplicity of that idea :) Thanks! On Nov 11, 2011, at 11:41 AM, Axel Kohlmeyer wrote: > On Fri, Nov 11, 2011 at 12:07 PM, Ti Leggett wrote: >> Can you run CUDA gdb while X is running. I have a user trying to do this and this is the error they're getting: >> >> "error: All CUDA devices are used for X11 and cannot be used while debugging." > > no. i never even tried using cuda-gdb, but from what i know > about how it is supposed to work, this is likely a case where > you have to have a long-lived kernel and then it will collide > with using X at the same time. the error message confirms that. > > question is, does a single person needing to do some debugging > require a queue reconfiguration? unless this happens on a regular > basis. i would just set up a reservation for this user and then > turn off X on that/those node(s) for the time being. > > axel. > > >> >> On Nov 11, 2011, at 10:46 AM, Gustavo Correa wrote: >> >>> >>> On Nov 11, 2011, at 11:18 AM, Axel Kohlmeyer wrote: >>> >>>> On Fri, Nov 11, 2011 at 10:35 AM, Ti Leggett wrote: >>>>> We have NV GPUs and we have some users who want to run GPGPU jobs (like CUDA) and we have other users who want to run GL GPU jobs. GL jobs require the machine to have X started (runlevel 5) and GPGPU jobs can't run when X is running. Does anyone have a good method of letting users specify which type of GPU job they need to run and changing the runlevel appropriately? >>>> >>>> with nvidia hardware GPGPU jobs _can_ run when X >>>> is running. i am doing that on my desktop all the time. >>>> you may need to tweak the timeout that is set to >>>> keep GPGPU applications from hogging the GPU >>>> when X is running, if your GPGPU users write kernels >>>> that run excessively long. in most cases, that is >>>> just bad program design. >>>> >>>> axel. >>>> >>> >>> Hi Ti >>> >>> I guess you don't want to let users change the machine runlevel. >>> However, I presume you could check the if the job requires X and change the runlevel >>> in a prologue script, >>> then return to runlevel 3 in an epilogue script at the end of the job. >>> >>> I suppose you could identify the GL_GPU jobs if you associate them >>> to a node property, e.g. it could be named GL_GPU and added to the appropriate nodes in the server_priv/nodes file. >>> Then the user would request nodes with the 'GL_GPU' property on her/his Torque qsub >>> script/command line, which your preamble script could then deal with by changing runlevel >>> to 5. >>> >>> Just a suggestion. >>> >>> Gus Correa >>> >>>> >>>>> -----BEGIN PGP SIGNATURE----- >>>>> >>>>> iEYEARECAAYFAk69QNEACgkQ4RgdOxQVi0DwCQCfSGsUD+/h2wfhPUeuI9k8i8lf >>>>> ScIAnAp3crBjAdQ/keek1ZuEKqbidqSq >>>>> =BmBW >>>>> -----END PGP SIGNATURE----- >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Dr. Axel Kohlmeyer akohlmey at gmail.com >>>> http://sites.google.com/site/akohlmey/ >>>> >>>> Institute for Computational Molecular Science >>>> Temple University, Philadelphia PA, USA. >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> -----BEGIN PGP SIGNATURE----- >> >> iEYEARECAAYFAk69VmsACgkQ4RgdOxQVi0AAgACdGOJSDr2lTjYc446hHdDvoxW+ >> Ik4An2ZJFEAtY9jTHVvJe1dkDuoUQwHt >> =i2x0 >> -----END PGP SIGNATURE----- >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 163 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111111/5976bc86/attachment-0001.bin From kenneth at sdsc.edu Fri Nov 11 16:55:27 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Fri, 11 Nov 2011 15:55:27 -0800 (PST) Subject: [torqueusers] UC prologue.parallel? In-Reply-To: <20111111142402.786B383A802B@mail.adaptivecomputing.com> References: <20111111142402.786B383A802B@mail.adaptivecomputing.com> Message-ID: Rushton, Thanks for the hints. I will try 500 perms and see what happens. Kenneth On Fri, 11 Nov 2011, Rushton Martin wrote: > Date: Fri, 11 Nov 2011 14:24:36 -0000 > From: Rushton Martin > Reply-To: Torque Users Mailing List > To: Torque Users Mailing List > Subject: Re: [torqueusers] UC prologue.parallel? > > I had some fun with this a while back. prologue.parallel runs on a > sister node, not on the Mother Superior. Its stdout and stderr are NOT > connected to the job's stdout and stderr, so any output is lost. You > need to route the output to a file and then look at the file to see what > is happening. Remember also that prologue.parallel is running as root, > you need prologue.user.parallel if you want to run as the user. Watch > the permissions also, 500 for the former and 555 for the latter. If any > root run pro/epi has other:w, then it will not run. > > > Martin Rushton > HPC System Manager, Weapons Technologies > Tel: 01959 514777, Mobile: 07939 219057 > email: jmrushton at QinetiQ.com > www.QinetiQ.com > QinetiQ - Delivering customer-focused solutions > > Please consider the environment before printing this email. > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Kenneth > Yoshimoto > Sent: 10 November 2011 23:56 > To: torqueusers at supercluster.org > Subject: [torqueusers] prologue.parallel? > > > I'm running Torque 3.0.2 > I'm trying to use a prologue.parallel, but it doesn't seem to ever get > invoked. Tried looking in Torque src, but I don't see where it gets > started. Where is the code that starts it? > > Thanks, > Kenneth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > This email and any attachments to it may be confidential and are > intended solely for the use of the individual to whom it is > addressed. If you are not the intended recipient of this email, > you must neither take any action based upon its contents, nor > copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. QinetiQ may > monitor email traffic data and also the content of email for > the purposes of security. QinetiQ Limited (Registered in England > & Wales: Company Number: 3796233) Registered office: Cody Technology > Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From zhaohscas at yahoo.com.cn Sun Nov 13 01:21:02 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Sun, 13 Nov 2011 16:21:02 +0800 Subject: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. Message-ID: <4EBF7DEE.5040301@yahoo.com.cn> Hi all, When I submit a job to my hpc with pbs script, I meet the following err and the job cann't start at all: --------------- zhaohongsheng at node32:~/work/Dr.Zhao/castep_test_myhome> cat stderr libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. castepexe_mpi.exe: Rank 0:0: MPI_Init: ibv_create_cq() failed castepexe_mpi.exe: Rank 0:0: MPI_Init: Can't initialize RDMA device castepexe_mpi.exe: Rank 0:0: MPI_Init: MPI BUG: Cannot initialize RDMA protocol MPI Application rank 0 exited before MPI_Init() with status 1 forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B52C15F6C00 Unknown Unknown Unknown libmpi.so.1 00002B52C0F5FC8B Unknown Unknown Unknown libmpi.so.1 00002B52C0F3B57A Unknown Unknown Unknown libmpi.so.1 00002B52C0FAF03C Unknown Unknown Unknown libmpi.so.1 00002B52C0FBB3AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B52C171F184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B197B97CC00 Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B06AAF7 Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B05D2E1 Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B05F757 Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B067FCD Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B0641F6 Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B067ACB Unknown Unknown Unknown libdl.so.2 00002B197BDD71FA Unknown Unknown Unknown ld-linux-x86-64.s 00002B197B0641F6 Unknown Unknown Unknown libdl.so.2 00002B197BDD758D Unknown Unknown Unknown libdl.so.2 00002B197BDD7171 Unknown Unknown Unknown libmpi.so.1 00002B197B36028A Unknown Unknown Unknown libmpi.so.1 00002B197B2C198E Unknown Unknown Unknown libmpi.so.1 00002B197B33503C Unknown Unknown Unknown libmpi.so.1 00002B197B3413AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B197BAA5184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B8D54E6DC00 Unknown Unknown Unknown libpthread.so.0 00002B8D54E6C900 Unknown Unknown Unknown libibverbs.so 00002B8D57B8C3A4 Unknown Unknown Unknown libibverbs.so 00002B8D57B8D06A Unknown Unknown Unknown libmpi.so.1 00002B8D547F1FA7 Unknown Unknown Unknown libmpi.so.1 00002B8D547EBE36 Unknown Unknown Unknown libmpi.so.1 00002B8D547D8F0C Unknown Unknown Unknown libmpi.so.1 00002B8D547B3E1D Unknown Unknown Unknown libmpi.so.1 00002B8D547B2999 Unknown Unknown Unknown libmpi.so.1 00002B8D5482603C Unknown Unknown Unknown libmpi.so.1 00002B8D548323AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B8D54F96184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B700FF46C00 Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F634AF7 Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F6272E1 Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F627827 Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F629987 Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F631FCD Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F62E1F6 Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F631ACB Unknown Unknown Unknown libdl.so.2 00002B70103A11FA Unknown Unknown Unknown ld-linux-x86-64.s 00002B700F62E1F6 Unknown Unknown Unknown libdl.so.2 00002B70103A158D Unknown Unknown Unknown libdl.so.2 00002B70103A1171 Unknown Unknown Unknown libmpi.so.1 00002B700F8A28EC Unknown Unknown Unknown libmpi.so.1 00002B700F8A2A27 Unknown Unknown Unknown libmpi.so.1 00002B700F8E34BF Unknown Unknown Unknown libmpi.so.1 00002B700F8C2C4B Unknown Unknown Unknown libmpi.so.1 00002B700F8B230C Unknown Unknown Unknown libmpi.so.1 00002B700F88CE1D Unknown Unknown Unknown libmpi.so.1 00002B700F88B999 Unknown Unknown Unknown libmpi.so.1 00002B700F8FF03C Unknown Unknown Unknown libmpi.so.1 00002B700F90B3AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B701006F184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B92E5C7BC00 Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E5369AF7 Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E535C2E1 Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E535C827 Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E535E987 Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E5366FCD Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E53631F6 Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E5366ACB Unknown Unknown Unknown libdl.so.2 00002B92E60D61FA Unknown Unknown Unknown ld-linux-x86-64.s 00002B92E53631F6 Unknown Unknown Unknown libdl.so.2 00002B92E60D658D Unknown Unknown Unknown libdl.so.2 00002B92E60D6171 Unknown Unknown Unknown libmpi.so.1 00002B92E55D78EC Unknown Unknown Unknown libmpi.so.1 00002B92E55D7A27 Unknown Unknown Unknown libmpi.so.1 00002B92E56184BF Unknown Unknown Unknown libmpi.so.1 00002B92E55F7C4B Unknown Unknown Unknown libmpi.so.1 00002B92E55E730C Unknown Unknown Unknown libmpi.so.1 00002B92E55C1E1D Unknown Unknown Unknown libmpi.so.1 00002B92E55C0999 Unknown Unknown Unknown libmpi.so.1 00002B92E563403C Unknown Unknown Unknown libmpi.so.1 00002B92E56403AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B92E5DA4184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002AABE6311C00 Unknown Unknown Unknown libpthread.so.0 00002AABE6310900 Unknown Unknown Unknown libibverbs.so 00002AABE90303A4 Unknown Unknown Unknown libibverbs.so 00002AABE903106A Unknown Unknown Unknown libmpi.so.1 00002AABE5C95FA7 Unknown Unknown Unknown libmpi.so.1 00002AABE5C8FE36 Unknown Unknown Unknown libmpi.so.1 00002AABE5C7CF0C Unknown Unknown Unknown libmpi.so.1 00002AABE5C57E1D Unknown Unknown Unknown libmpi.so.1 00002AABE5C56999 Unknown Unknown Unknown libmpi.so.1 00002AABE5CCA03C Unknown Unknown Unknown libmpi.so.1 00002AABE5CD63AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002AABE643A184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B0D68F83C00 Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D68671AF7 Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D686642E1 Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D68666757 Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D6866EFCD Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D6866B1F6 Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D6866EACB Unknown Unknown Unknown libdl.so.2 00002B0D693DE1FA Unknown Unknown Unknown ld-linux-x86-64.s 00002B0D6866B1F6 Unknown Unknown Unknown libdl.so.2 00002B0D693DE58D Unknown Unknown Unknown libdl.so.2 00002B0D693DE171 Unknown Unknown Unknown libmpi.so.1 00002B0D6896728A Unknown Unknown Unknown libmpi.so.1 00002B0D688C898E Unknown Unknown Unknown libmpi.so.1 00002B0D6893C03C Unknown Unknown Unknown libmpi.so.1 00002B0D689483AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B0D690AC184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002B83A77EFC00 Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6ED34E0 Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6ED39B3 Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6ED5028 Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6EDB2A5 Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6ED71F6 Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6EDAACB Unknown Unknown Unknown libdl.so.2 00002B83A7C4A1FA Unknown Unknown Unknown ld-linux-x86-64.s 00002B83A6ED71F6 Unknown Unknown Unknown libdl.so.2 00002B83A7C4A58D Unknown Unknown Unknown libdl.so.2 00002B83A7C4A171 Unknown Unknown Unknown libmpi.so.1 00002B83A714B8EC Unknown Unknown Unknown libmpi.so.1 00002B83A714BA27 Unknown Unknown Unknown libmpi.so.1 00002B83A718C4BF Unknown Unknown Unknown libmpi.so.1 00002B83A716BC4B Unknown Unknown Unknown libmpi.so.1 00002B83A715B30C Unknown Unknown Unknown libmpi.so.1 00002B83A7135E1D Unknown Unknown Unknown libmpi.so.1 00002B83A7134999 Unknown Unknown Unknown libmpi.so.1 00002B83A71A803C Unknown Unknown Unknown libmpi.so.1 00002B83A71B43AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B83A7918184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread.so.0 00002AC2DC492C00 Unknown Unknown Unknown libpthread.so.0 00002AC2DC491900 Unknown Unknown Unknown libibverbs.so 00002AC2DF1B13A4 Unknown Unknown Unknown libibverbs.so 00002AC2DF1B206A Unknown Unknown Unknown libmpi.so.1 00002AC2DBE16FA7 Unknown Unknown Unknown libmpi.so.1 00002AC2DBE10E36 Unknown Unknown Unknown libmpi.so.1 00002AC2DBDFDF0C Unknown Unknown Unknown libmpi.so.1 00002AC2DBDD8E1D Unknown Unknown Unknown libmpi.so.1 00002AC2DBDD7999 Unknown Unknown Unknown libmpi.so.1 00002AC2DBE4B03C Unknown Unknown Unknown libmpi.so.1 00002AC2DBE573AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002AC2DC5BB184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source plugin002_004_002 00002B195294BA80 Unknown Unknown Unknown libmpi.so.1 00002B1951975108 Unknown Unknown Unknown libmpi.so.1 00002B19518D698E Unknown Unknown Unknown libmpi.so.1 00002B195194A03C Unknown Unknown Unknown libmpi.so.1 00002B19519563AB Unknown Unknown Unknown castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown libc.so.6 00002B19520BA184 Unknown Unknown Unknown castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown mpirun: Broken pipe ----------------- Could you please give me some hints? Thanks in advance. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From stijn.deweirdt at ugent.be Sun Nov 13 02:25:24 2011 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Sun, 13 Nov 2011 10:25:24 +0100 Subject: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. In-Reply-To: <4EBF7DEE.5040301@yahoo.com.cn> References: <4EBF7DEE.5040301@yahoo.com.cn> Message-ID: <4EBF8D04.8060805@ugent.be> this is unrelated to torque, but here's the relevant faq entry for openmpi http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages stijn On 11/13/2011 09:21 AM, Hongsheng Zhao wrote: > Hi all, > > When I submit a job to my hpc with pbs script, I meet the following err > and the job cann't start at all: > > --------------- > zhaohongsheng at node32:~/work/Dr.Zhao/castep_test_myhome> cat stderr > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > castepexe_mpi.exe: Rank 0:0: MPI_Init: ibv_create_cq() failed > castepexe_mpi.exe: Rank 0:0: MPI_Init: Can't initialize RDMA device > castepexe_mpi.exe: Rank 0:0: MPI_Init: MPI BUG: Cannot initialize RDMA > protocol > MPI Application rank 0 exited before MPI_Init() with status 1 > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B52C15F6C00 Unknown Unknown Unknown > libmpi.so.1 00002B52C0F5FC8B Unknown Unknown Unknown > libmpi.so.1 00002B52C0F3B57A Unknown Unknown Unknown > libmpi.so.1 00002B52C0FAF03C Unknown Unknown Unknown > libmpi.so.1 00002B52C0FBB3AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B52C171F184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B197B97CC00 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B06AAF7 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B05D2E1 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B05F757 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B067FCD Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B0641F6 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B067ACB Unknown Unknown Unknown > libdl.so.2 00002B197BDD71FA Unknown Unknown Unknown > ld-linux-x86-64.s 00002B197B0641F6 Unknown Unknown Unknown > libdl.so.2 00002B197BDD758D Unknown Unknown Unknown > libdl.so.2 00002B197BDD7171 Unknown Unknown Unknown > libmpi.so.1 00002B197B36028A Unknown Unknown Unknown > libmpi.so.1 00002B197B2C198E Unknown Unknown Unknown > libmpi.so.1 00002B197B33503C Unknown Unknown Unknown > libmpi.so.1 00002B197B3413AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B197BAA5184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B8D54E6DC00 Unknown Unknown Unknown > libpthread.so.0 00002B8D54E6C900 Unknown Unknown Unknown > libibverbs.so 00002B8D57B8C3A4 Unknown Unknown Unknown > libibverbs.so 00002B8D57B8D06A Unknown Unknown Unknown > libmpi.so.1 00002B8D547F1FA7 Unknown Unknown Unknown > libmpi.so.1 00002B8D547EBE36 Unknown Unknown Unknown > libmpi.so.1 00002B8D547D8F0C Unknown Unknown Unknown > libmpi.so.1 00002B8D547B3E1D Unknown Unknown Unknown > libmpi.so.1 00002B8D547B2999 Unknown Unknown Unknown > libmpi.so.1 00002B8D5482603C Unknown Unknown Unknown > libmpi.so.1 00002B8D548323AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B8D54F96184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B700FF46C00 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F634AF7 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F6272E1 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F627827 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F629987 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F631FCD Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F62E1F6 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F631ACB Unknown Unknown Unknown > libdl.so.2 00002B70103A11FA Unknown Unknown Unknown > ld-linux-x86-64.s 00002B700F62E1F6 Unknown Unknown Unknown > libdl.so.2 00002B70103A158D Unknown Unknown Unknown > libdl.so.2 00002B70103A1171 Unknown Unknown Unknown > libmpi.so.1 00002B700F8A28EC Unknown Unknown Unknown > libmpi.so.1 00002B700F8A2A27 Unknown Unknown Unknown > libmpi.so.1 00002B700F8E34BF Unknown Unknown Unknown > libmpi.so.1 00002B700F8C2C4B Unknown Unknown Unknown > libmpi.so.1 00002B700F8B230C Unknown Unknown Unknown > libmpi.so.1 00002B700F88CE1D Unknown Unknown Unknown > libmpi.so.1 00002B700F88B999 Unknown Unknown Unknown > libmpi.so.1 00002B700F8FF03C Unknown Unknown Unknown > libmpi.so.1 00002B700F90B3AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B701006F184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B92E5C7BC00 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E5369AF7 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E535C2E1 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E535C827 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E535E987 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E5366FCD Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E53631F6 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E5366ACB Unknown Unknown Unknown > libdl.so.2 00002B92E60D61FA Unknown Unknown Unknown > ld-linux-x86-64.s 00002B92E53631F6 Unknown Unknown Unknown > libdl.so.2 00002B92E60D658D Unknown Unknown Unknown > libdl.so.2 00002B92E60D6171 Unknown Unknown Unknown > libmpi.so.1 00002B92E55D78EC Unknown Unknown Unknown > libmpi.so.1 00002B92E55D7A27 Unknown Unknown Unknown > libmpi.so.1 00002B92E56184BF Unknown Unknown Unknown > libmpi.so.1 00002B92E55F7C4B Unknown Unknown Unknown > libmpi.so.1 00002B92E55E730C Unknown Unknown Unknown > libmpi.so.1 00002B92E55C1E1D Unknown Unknown Unknown > libmpi.so.1 00002B92E55C0999 Unknown Unknown Unknown > libmpi.so.1 00002B92E563403C Unknown Unknown Unknown > libmpi.so.1 00002B92E56403AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B92E5DA4184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002AABE6311C00 Unknown Unknown Unknown > libpthread.so.0 00002AABE6310900 Unknown Unknown Unknown > libibverbs.so 00002AABE90303A4 Unknown Unknown Unknown > libibverbs.so 00002AABE903106A Unknown Unknown Unknown > libmpi.so.1 00002AABE5C95FA7 Unknown Unknown Unknown > libmpi.so.1 00002AABE5C8FE36 Unknown Unknown Unknown > libmpi.so.1 00002AABE5C7CF0C Unknown Unknown Unknown > libmpi.so.1 00002AABE5C57E1D Unknown Unknown Unknown > libmpi.so.1 00002AABE5C56999 Unknown Unknown Unknown > libmpi.so.1 00002AABE5CCA03C Unknown Unknown Unknown > libmpi.so.1 00002AABE5CD63AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002AABE643A184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B0D68F83C00 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D68671AF7 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D686642E1 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D68666757 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D6866EFCD Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D6866B1F6 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D6866EACB Unknown Unknown Unknown > libdl.so.2 00002B0D693DE1FA Unknown Unknown Unknown > ld-linux-x86-64.s 00002B0D6866B1F6 Unknown Unknown Unknown > libdl.so.2 00002B0D693DE58D Unknown Unknown Unknown > libdl.so.2 00002B0D693DE171 Unknown Unknown Unknown > libmpi.so.1 00002B0D6896728A Unknown Unknown Unknown > libmpi.so.1 00002B0D688C898E Unknown Unknown Unknown > libmpi.so.1 00002B0D6893C03C Unknown Unknown Unknown > libmpi.so.1 00002B0D689483AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B0D690AC184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002B83A77EFC00 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6ED34E0 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6ED39B3 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6ED5028 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6EDB2A5 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6ED71F6 Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6EDAACB Unknown Unknown Unknown > libdl.so.2 00002B83A7C4A1FA Unknown Unknown Unknown > ld-linux-x86-64.s 00002B83A6ED71F6 Unknown Unknown Unknown > libdl.so.2 00002B83A7C4A58D Unknown Unknown Unknown > libdl.so.2 00002B83A7C4A171 Unknown Unknown Unknown > libmpi.so.1 00002B83A714B8EC Unknown Unknown Unknown > libmpi.so.1 00002B83A714BA27 Unknown Unknown Unknown > libmpi.so.1 00002B83A718C4BF Unknown Unknown Unknown > libmpi.so.1 00002B83A716BC4B Unknown Unknown Unknown > libmpi.so.1 00002B83A715B30C Unknown Unknown Unknown > libmpi.so.1 00002B83A7135E1D Unknown Unknown Unknown > libmpi.so.1 00002B83A7134999 Unknown Unknown Unknown > libmpi.so.1 00002B83A71A803C Unknown Unknown Unknown > libmpi.so.1 00002B83A71B43AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B83A7918184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > libpthread.so.0 00002AC2DC492C00 Unknown Unknown Unknown > libpthread.so.0 00002AC2DC491900 Unknown Unknown Unknown > libibverbs.so 00002AC2DF1B13A4 Unknown Unknown Unknown > libibverbs.so 00002AC2DF1B206A Unknown Unknown Unknown > libmpi.so.1 00002AC2DBE16FA7 Unknown Unknown Unknown > libmpi.so.1 00002AC2DBE10E36 Unknown Unknown Unknown > libmpi.so.1 00002AC2DBDFDF0C Unknown Unknown Unknown > libmpi.so.1 00002AC2DBDD8E1D Unknown Unknown Unknown > libmpi.so.1 00002AC2DBDD7999 Unknown Unknown Unknown > libmpi.so.1 00002AC2DBE4B03C Unknown Unknown Unknown > libmpi.so.1 00002AC2DBE573AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002AC2DC5BB184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line > Source > plugin002_004_002 00002B195294BA80 Unknown Unknown Unknown > libmpi.so.1 00002B1951975108 Unknown Unknown Unknown > libmpi.so.1 00002B19518D698E Unknown Unknown Unknown > libmpi.so.1 00002B195194A03C Unknown Unknown Unknown > libmpi.so.1 00002B19519563AB Unknown Unknown Unknown > castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown > libc.so.6 00002B19520BA184 Unknown Unknown Unknown > castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown > mpirun: Broken pipe > ----------------- > > Could you please give me some hints? Thanks in advance. > > Regards From ljjlp03 at gmail.com Sun Nov 13 08:48:29 2011 From: ljjlp03 at gmail.com (liu junjun) Date: Sun, 13 Nov 2011 23:48:29 +0800 Subject: [torqueusers] installation/configuration problem with multi-homed system. --- unauthorized host/request Message-ID: Hi everyone, I am trying to install torque-3.0.2 on a multi-homed system (two NIC networks) but having an authority problem. Please read my description on the problem below. Any helps are highly appreciated! ---- System information ---- OS: Ubuntu 10.10 eth0: external_host_name eth1: internal_host_name hostname: internal_hostname -------------------------------------------- ---- Basic Torque information ---- Torque version: 3.0.2 content of /var/spool/torque/server_name: internal_host_name content of /var/spool/torque/torque.cfg: SERVERHOST internal_host_name server and nodes can ping each other with internal_host_name ---------------------------------------- ---- the problem ------------- 1. My first try on the installation: By following the installation document at http://www.adaptivecomputing.com/resources/docs/torque/1.1installation.php, I have problem with "torque.setup" script. It gave me "unauthorized request". I noticed that the problem may related to my two NIC cards. Then I double checked the server_name file and also added "SERVERHOST interal_host_name" to torque.cfg. Unfortunately, problem sitll remains. 2. My 2nd try on the installation: I removed the first installation, and disabled eth0 which is associated with external_host_name, and recompiled torque again with the exactly same steps as that in my first try on the installation. Everything seems fine. I can create a batch queue and can submit jobs which run and terminate normally. However, once I enable eth0 (external_host_name), every qmgr command returns "unauthorized request". I noticed that the server recognizes me as user at external_host_name, whereas the pbs server is set as internal_host_name which is also the hostname. I guess this causes the "unauthorized" issue, so I made the following settings, by disabling eth0 to get the authority on the operation: ==== qmgr -c 's s acl_hosts += external_host_name' qmgr -c 's s managers += root at external_host_name' qmgr -c 's s operators += root at external_host_name' qmgr -c 's s submit_hosts += external_host_name' ==== After the above commands, I gain the operational access to the pbs_server even when eth0 is enabled. However, all the submitted jobs are still remain in the Q state. The followings are part of the 'qstat -f' command and log files on the server: ==== part of 'qstat -f' command ===== Job Id: 51.internal_host_name Job_Name = STDIN Job_Owner = user at exteral_host_name job_state = Q queue = batch server = internal_host_name Checkpoint = u ctime = Sun Nov 13 19:25:12 2011 Error_Path = internal_host_name:/home/liu/STDIN.e51 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Sun Nov 13 19:25:12 2011 Output_Path = internal_host_name:/home/liu/STDIN.o51 =============================== ==== part of pbs_server log ====== 11/13/2011 19:25:05;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 3.0.2, loglevel = 0 11/13/2011 19:25:12;0100;PBS_Server;Job;51.interal_host_name;enqueuing into batch, state 1 hop 1 11/13/2011 19:25:12;0008;PBS_Server;Job;51.interal_host_name;Job Queued at request of user at external_host_name, owner = user at external_host_name, job name = STDIN, queue = batch 11/13/2011 19:25:12;0040;PBS_Server;Svr;cddlogin;Scheduler was sent the command new 11/13/2011 19:25:12;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect 11/13/2011 19:25:12;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @ 11/13/2011 19:25:12;0002;PBS_Server;Req;dis_reply_write;DIS reply failure, -1 ========================= ==== part of pbs_sche log ====== 11/13/2011 19:25:12;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::badconn, external_host_name on port 762 unauthorized host ========================== As you can see from the above information, although exteral_host_name is set as a submit_host, all jobs are still remain in 'Q' state because the job owner is user at external_host_name! My question is : either 1. how to make the server to accept jobs from users at external_host_name? or 2. how to make the server to recognize every submitted jobs as belonging to user at internal_host_name? Thanks in advance! Junjun -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111113/89e3d7a7/attachment.html From jwbacon at tds.net Mon Nov 14 07:26:40 2011 From: jwbacon at tds.net (Jason Bacon) Date: Mon, 14 Nov 2011 08:26:40 -0600 Subject: [torqueusers] installation/configuration problem with multi-homed system. --- unauthorized host/request In-Reply-To: References: Message-ID: <4EC12520.6000809@tds.net> I had a similar issue and got around it by simply setting up /etc/hosts on each node properly. On the multihomed head node, the hostname is bound to the external IP in /etc/hosts. On the compute nodes, the hostname of the head node is bound to it's internal address. Also be sure that name resolution on the compute nodes is configured to check files before DNS. No special configuration was required within torque. Regards, -J On 11/13/11 09:48, liu junjun wrote: > Hi everyone, > > I am trying to install torque-3.0.2 on a multi-homed system (two NIC > networks) but having an authority problem. Please read my description > on the problem below. Any helps are highly appreciated! > > ---- System information ---- > OS: Ubuntu 10.10 > eth0: external_host_name > eth1: internal_host_name > hostname: internal_hostname > -------------------------------------------- > > ---- Basic Torque information ---- > Torque version: 3.0.2 > content of /var/spool/torque/server_name: internal_host_name > content of /var/spool/torque/torque.cfg: SERVERHOST internal_host_name > > server and nodes can ping each other with internal_host_name > ---------------------------------------- > > > ---- the problem ------------- > 1. My first try on the installation: > By following the installation document at > http://www.adaptivecomputing.com/resources/docs/torque/1.1installation.php, > I have problem with "torque.setup" script. It gave me "unauthorized > request". I noticed that the problem may related to my two NIC cards. > Then I double checked the server_name file and also added "SERVERHOST > interal_host_name" to torque.cfg. Unfortunately, problem sitll remains. > > 2. My 2nd try on the installation: > I removed the first installation, and disabled eth0 which is > associated with external_host_name, and recompiled torque again with > the exactly same steps as that in my first try on the installation. > Everything seems fine. I can create a batch queue and can submit jobs > which run and terminate normally. However, once I enable eth0 > (external_host_name), every qmgr command returns "unauthorized > request". I noticed that the server recognizes me as > user at external_host_name, whereas the pbs server is set as > internal_host_name which is also the hostname. I guess this causes the > "unauthorized" issue, so I made the following settings, by disabling > eth0 to get the authority on the operation: > ==== > qmgr -c 's s acl_hosts += external_host_name' > qmgr -c 's s managers += root at external_host_name' > qmgr -c 's s operators += root at external_host_name' > qmgr -c 's s submit_hosts += external_host_name' > ==== > > After the above commands, I gain the operational access to the > pbs_server even when eth0 is enabled. However, all the submitted jobs > are still remain in the Q state. The followings are part of the 'qstat > -f' command and log files on the server: > ==== part of 'qstat -f' command ===== > Job Id: 51.internal_host_name > Job_Name = STDIN > Job_Owner = user at exteral_host_name > job_state = Q > queue = batch > server = internal_host_name > Checkpoint = u > ctime = Sun Nov 13 19:25:12 2011 > Error_Path = internal_host_name:/home/liu/STDIN.e51 > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = a > mtime = Sun Nov 13 19:25:12 2011 > Output_Path = internal_host_name:/home/liu/STDIN.o51 > =============================== > > ==== part of pbs_server log ====== > 11/13/2011 19:25:05;0002;PBS_Server;Svr;PBS_Server;Torque Server > Version = 3.0.2, loglevel = 0 > 11/13/2011 19:25:12;0100;PBS_Server;Job;51.interal_host_name;enqueuing > into batch, state 1 hop 1 > 11/13/2011 19:25:12;0008;PBS_Server;Job;51.interal_host_name;Job > Queued at request of user at external_host_name, owner = > user at external_host_name, job name = STDIN, queue = batch > 11/13/2011 19:25:12;0040;PBS_Server;Svr;cddlogin;Scheduler was sent > the command new > 11/13/2011 19:25:12;0080;PBS_Server;Req;dis_request_read;req header > bad, dis error 7 (Premature end of message), type=Connect > 11/13/2011 19:25:12;0080;PBS_Server;Req;req_reject;Reject reply > code=15058(Bad DIS based Request Protocol MSG=cannot decode message), > aux=0, type=Connect, from @ > 11/13/2011 19:25:12;0002;PBS_Server;Req;dis_reply_write;DIS reply > failure, -1 > ========================= > > ==== part of pbs_sche log ====== > 11/13/2011 19:25:12;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::badconn, > external_host_name on port 762 unauthorized host > ========================== > > As you can see from the above information, although exteral_host_name > is set as a submit_host, all jobs are still remain in 'Q' state > because the job owner is user at external_host_name! My question is : > either 1. how to make the server to accept jobs from > users at external_host_name? > or 2. how to make the server to recognize every submitted jobs as > belonging to user at internal_host_name? > > Thanks in advance! > > Junjun > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From lloyd_brown at byu.edu Mon Nov 14 07:41:23 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Mon, 14 Nov 2011 07:41:23 -0700 Subject: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. In-Reply-To: <4EBF8D04.8060805@ugent.be> References: <4EBF7DEE.5040301@yahoo.com.cn> <4EBF8D04.8060805@ugent.be> Message-ID: <4EC12893.8050208@byu.edu> I'm not sure I'd go so far as to say it's definitely unrelated to Torque. We ran into a similar problem, and in the end it was a limitation imposed by the shell/environment when the pbs_mom started. Since jobs were child processes of the pbs_mom, they inherited the memory limits. Since we're using the RPM packages, and the contrib'd init script, the best solution was to put the following lines in /etc/sysconfig/pbs_mom; if you're using a different environment, etc., you may have to adjust the method you use to do something similar: > #installation-specific limits for pbs_mom > ulimit -l unlimited > ulimit -n 32768 There are a lot of opinions about what the right limits to put here are, as well. Lloyd Brown On 11/13/11 2:25 AM, Stijn De Weirdt wrote: > this is unrelated to torque, but here's the relevant faq entry for openmpi > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > stijn > > On 11/13/2011 09:21 AM, Hongsheng Zhao wrote: >> Hi all, >> >> When I submit a job to my hpc with pbs script, I meet the following err >> and the job cann't start at all: >> >> --------------- >> zhaohongsheng at node32:~/work/Dr.Zhao/castep_test_myhome> cat stderr >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >> This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >> This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >> This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >> This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >> This will severely limit memory registrations. >> castepexe_mpi.exe: Rank 0:0: MPI_Init: ibv_create_cq() failed >> castepexe_mpi.exe: Rank 0:0: MPI_Init: Can't initialize RDMA device >> castepexe_mpi.exe: Rank 0:0: MPI_Init: MPI BUG: Cannot initialize RDMA >> protocol >> MPI Application rank 0 exited before MPI_Init() with status 1 >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B52C15F6C00 Unknown Unknown Unknown >> libmpi.so.1 00002B52C0F5FC8B Unknown Unknown Unknown >> libmpi.so.1 00002B52C0F3B57A Unknown Unknown Unknown >> libmpi.so.1 00002B52C0FAF03C Unknown Unknown Unknown >> libmpi.so.1 00002B52C0FBB3AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B52C171F184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B197B97CC00 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B06AAF7 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B05D2E1 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B05F757 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B067FCD Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B0641F6 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B067ACB Unknown Unknown Unknown >> libdl.so.2 00002B197BDD71FA Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B197B0641F6 Unknown Unknown Unknown >> libdl.so.2 00002B197BDD758D Unknown Unknown Unknown >> libdl.so.2 00002B197BDD7171 Unknown Unknown Unknown >> libmpi.so.1 00002B197B36028A Unknown Unknown Unknown >> libmpi.so.1 00002B197B2C198E Unknown Unknown Unknown >> libmpi.so.1 00002B197B33503C Unknown Unknown Unknown >> libmpi.so.1 00002B197B3413AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B197BAA5184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B8D54E6DC00 Unknown Unknown Unknown >> libpthread.so.0 00002B8D54E6C900 Unknown Unknown Unknown >> libibverbs.so 00002B8D57B8C3A4 Unknown Unknown Unknown >> libibverbs.so 00002B8D57B8D06A Unknown Unknown Unknown >> libmpi.so.1 00002B8D547F1FA7 Unknown Unknown Unknown >> libmpi.so.1 00002B8D547EBE36 Unknown Unknown Unknown >> libmpi.so.1 00002B8D547D8F0C Unknown Unknown Unknown >> libmpi.so.1 00002B8D547B3E1D Unknown Unknown Unknown >> libmpi.so.1 00002B8D547B2999 Unknown Unknown Unknown >> libmpi.so.1 00002B8D5482603C Unknown Unknown Unknown >> libmpi.so.1 00002B8D548323AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B8D54F96184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B700FF46C00 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F634AF7 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F6272E1 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F627827 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F629987 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F631FCD Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F62E1F6 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F631ACB Unknown Unknown Unknown >> libdl.so.2 00002B70103A11FA Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B700F62E1F6 Unknown Unknown Unknown >> libdl.so.2 00002B70103A158D Unknown Unknown Unknown >> libdl.so.2 00002B70103A1171 Unknown Unknown Unknown >> libmpi.so.1 00002B700F8A28EC Unknown Unknown Unknown >> libmpi.so.1 00002B700F8A2A27 Unknown Unknown Unknown >> libmpi.so.1 00002B700F8E34BF Unknown Unknown Unknown >> libmpi.so.1 00002B700F8C2C4B Unknown Unknown Unknown >> libmpi.so.1 00002B700F8B230C Unknown Unknown Unknown >> libmpi.so.1 00002B700F88CE1D Unknown Unknown Unknown >> libmpi.so.1 00002B700F88B999 Unknown Unknown Unknown >> libmpi.so.1 00002B700F8FF03C Unknown Unknown Unknown >> libmpi.so.1 00002B700F90B3AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B701006F184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B92E5C7BC00 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E5369AF7 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E535C2E1 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E535C827 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E535E987 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E5366FCD Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E53631F6 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E5366ACB Unknown Unknown Unknown >> libdl.so.2 00002B92E60D61FA Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B92E53631F6 Unknown Unknown Unknown >> libdl.so.2 00002B92E60D658D Unknown Unknown Unknown >> libdl.so.2 00002B92E60D6171 Unknown Unknown Unknown >> libmpi.so.1 00002B92E55D78EC Unknown Unknown Unknown >> libmpi.so.1 00002B92E55D7A27 Unknown Unknown Unknown >> libmpi.so.1 00002B92E56184BF Unknown Unknown Unknown >> libmpi.so.1 00002B92E55F7C4B Unknown Unknown Unknown >> libmpi.so.1 00002B92E55E730C Unknown Unknown Unknown >> libmpi.so.1 00002B92E55C1E1D Unknown Unknown Unknown >> libmpi.so.1 00002B92E55C0999 Unknown Unknown Unknown >> libmpi.so.1 00002B92E563403C Unknown Unknown Unknown >> libmpi.so.1 00002B92E56403AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B92E5DA4184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002AABE6311C00 Unknown Unknown Unknown >> libpthread.so.0 00002AABE6310900 Unknown Unknown Unknown >> libibverbs.so 00002AABE90303A4 Unknown Unknown Unknown >> libibverbs.so 00002AABE903106A Unknown Unknown Unknown >> libmpi.so.1 00002AABE5C95FA7 Unknown Unknown Unknown >> libmpi.so.1 00002AABE5C8FE36 Unknown Unknown Unknown >> libmpi.so.1 00002AABE5C7CF0C Unknown Unknown Unknown >> libmpi.so.1 00002AABE5C57E1D Unknown Unknown Unknown >> libmpi.so.1 00002AABE5C56999 Unknown Unknown Unknown >> libmpi.so.1 00002AABE5CCA03C Unknown Unknown Unknown >> libmpi.so.1 00002AABE5CD63AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002AABE643A184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B0D68F83C00 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D68671AF7 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D686642E1 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D68666757 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D6866EFCD Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D6866B1F6 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D6866EACB Unknown Unknown Unknown >> libdl.so.2 00002B0D693DE1FA Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B0D6866B1F6 Unknown Unknown Unknown >> libdl.so.2 00002B0D693DE58D Unknown Unknown Unknown >> libdl.so.2 00002B0D693DE171 Unknown Unknown Unknown >> libmpi.so.1 00002B0D6896728A Unknown Unknown Unknown >> libmpi.so.1 00002B0D688C898E Unknown Unknown Unknown >> libmpi.so.1 00002B0D6893C03C Unknown Unknown Unknown >> libmpi.so.1 00002B0D689483AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B0D690AC184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002B83A77EFC00 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6ED34E0 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6ED39B3 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6ED5028 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6EDB2A5 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6ED71F6 Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6EDAACB Unknown Unknown Unknown >> libdl.so.2 00002B83A7C4A1FA Unknown Unknown Unknown >> ld-linux-x86-64.s 00002B83A6ED71F6 Unknown Unknown Unknown >> libdl.so.2 00002B83A7C4A58D Unknown Unknown Unknown >> libdl.so.2 00002B83A7C4A171 Unknown Unknown Unknown >> libmpi.so.1 00002B83A714B8EC Unknown Unknown Unknown >> libmpi.so.1 00002B83A714BA27 Unknown Unknown Unknown >> libmpi.so.1 00002B83A718C4BF Unknown Unknown Unknown >> libmpi.so.1 00002B83A716BC4B Unknown Unknown Unknown >> libmpi.so.1 00002B83A715B30C Unknown Unknown Unknown >> libmpi.so.1 00002B83A7135E1D Unknown Unknown Unknown >> libmpi.so.1 00002B83A7134999 Unknown Unknown Unknown >> libmpi.so.1 00002B83A71A803C Unknown Unknown Unknown >> libmpi.so.1 00002B83A71B43AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B83A7918184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> libpthread.so.0 00002AC2DC492C00 Unknown Unknown Unknown >> libpthread.so.0 00002AC2DC491900 Unknown Unknown Unknown >> libibverbs.so 00002AC2DF1B13A4 Unknown Unknown Unknown >> libibverbs.so 00002AC2DF1B206A Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBE16FA7 Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBE10E36 Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBDFDF0C Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBDD8E1D Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBDD7999 Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBE4B03C Unknown Unknown Unknown >> libmpi.so.1 00002AC2DBE573AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002AC2DC5BB184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image PC Routine Line >> Source >> plugin002_004_002 00002B195294BA80 Unknown Unknown Unknown >> libmpi.so.1 00002B1951975108 Unknown Unknown Unknown >> libmpi.so.1 00002B19518D698E Unknown Unknown Unknown >> libmpi.so.1 00002B195194A03C Unknown Unknown Unknown >> libmpi.so.1 00002B19519563AB Unknown Unknown Unknown >> castepexe_mpi.exe 00000000004D6539 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000416404 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412DDC Unknown Unknown Unknown >> libc.so.6 00002B19520BA184 Unknown Unknown Unknown >> castepexe_mpi.exe 0000000000412D0A Unknown Unknown Unknown >> mpirun: Broken pipe >> ----------------- >> >> Could you please give me some hints? Thanks in advance. >> >> Regards > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From stijn.deweirdt at ugent.be Mon Nov 14 08:07:36 2011 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Mon, 14 Nov 2011 16:07:36 +0100 Subject: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. In-Reply-To: <4EC12893.8050208@byu.edu> References: <4EBF7DEE.5040301@yahoo.com.cn> <4EBF8D04.8060805@ugent.be> <4EC12893.8050208@byu.edu> Message-ID: <4EC12EB8.8030502@ugent.be> hi lloyd, we have the -n limit set like this as well, but if an application doesn't use torque to start the mpi processes on the other nodes (eg some mpi build without torque support and relying on eg ssh), how is this limit then set? stijn > I'm not sure I'd go so far as to say it's definitely unrelated to > Torque. We ran into a similar problem, and in the end it was a > limitation imposed by the shell/environment when the pbs_mom started. > Since jobs were child processes of the pbs_mom, they inherited the > memory limits. > > Since we're using the RPM packages, and the contrib'd init script, the > best solution was to put the following lines in /etc/sysconfig/pbs_mom; > if you're using a different environment, etc., you may have to adjust > the method you use to do something similar: > >> #installation-specific limits for pbs_mom >> ulimit -l unlimited >> ulimit -n 32768 > > There are a lot of opinions about what the right limits to put here are, > as well. > > Lloyd Brown > > > > > On 11/13/11 2:25 AM, Stijn De Weirdt wrote: >> this is unrelated to torque, but here's the relevant faq entry for openmpi >> >> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >> >> stijn >> >> From lloyd_brown at byu.edu Mon Nov 14 09:57:51 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Mon, 14 Nov 2011 09:57:51 -0700 Subject: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. In-Reply-To: <4EC12EB8.8030502@ugent.be> References: <4EBF7DEE.5040301@yahoo.com.cn> <4EBF8D04.8060805@ugent.be> <4EC12893.8050208@byu.edu> <4EC12EB8.8030502@ugent.be> Message-ID: <4EC1488F.3070705@byu.edu> Well, first of all, if you are using OpenMPI with Torque, then you really should get it recompiled with the TM API, so that the remote pbs_mom's can be the parent process of the corresponding user processes. To check if your version of OpenMPI was compiled with it, do something like "ompi_info | grep tm", and see what the output shows. Here's mine (Ignore the ptmalloc line): > $ ompi_info | grep tm > MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.2) > MCA ras: tm (MCA v2.0, API v2.0, Component v1.4.2) > MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.2) Even without that, the first process in the list will STILL be launched as a child of pbs_mom; the TM API enables the *other* processes to also be. Having said all that, if you have something that launches remote processes some other way, like some other MPI version, or some commercial package (usually uses ssh or rsh), then I'm not sure. Maybe the SSH daemon config or startup script? Maybe somewhere in /etc/security/limits.conf Lloyd On 11/14/11 8:07 AM, Stijn De Weirdt wrote: > hi lloyd, > > we have the -n limit set like this as well, but if an application > doesn't use torque to start the mpi processes on the other nodes (eg > some mpi build without torque support and relying on eg ssh), how is > this limit then set? > > > > stijn From ljjlp03 at gmail.com Mon Nov 14 19:34:42 2011 From: ljjlp03 at gmail.com (liu junjun) Date: Tue, 15 Nov 2011 10:34:42 +0800 Subject: [torqueusers] installation/configuration problem with multi-homed system. --- unauthorized host/request In-Reply-To: <4EC12520.6000809@tds.net> References: <4EC12520.6000809@tds.net> Message-ID: Hi Jason, Thank you very much! It works! Best, Junjun On Mon, Nov 14, 2011 at 10:26 PM, Jason Bacon wrote: > > I had a similar issue and got around it by simply setting up /etc/hosts > on each node properly. > > On the multihomed head node, the hostname is bound to the external IP in > /etc/hosts. On the compute nodes, the hostname of the head node is > bound to it's internal address. Also be sure that name resolution on > the compute nodes is configured to check files before DNS. > > No special configuration was required within torque. > > Regards, > > -J > > On 11/13/11 09:48, liu junjun wrote: > > Hi everyone, > > > > I am trying to install torque-3.0.2 on a multi-homed system (two NIC > > networks) but having an authority problem. Please read my description > > on the problem below. Any helps are highly appreciated! > > > > ---- System information ---- > > OS: Ubuntu 10.10 > > eth0: external_host_name > > eth1: internal_host_name > > hostname: internal_hostname > > -------------------------------------------- > > > > ---- Basic Torque information ---- > > Torque version: 3.0.2 > > content of /var/spool/torque/server_name: internal_host_name > > content of /var/spool/torque/torque.cfg: SERVERHOST internal_host_name > > > > server and nodes can ping each other with internal_host_name > > ---------------------------------------- > > > > > > ---- the problem ------------- > > 1. My first try on the installation: > > By following the installation document at > > > http://www.adaptivecomputing.com/resources/docs/torque/1.1installation.php > , > > I have problem with "torque.setup" script. It gave me "unauthorized > > request". I noticed that the problem may related to my two NIC cards. > > Then I double checked the server_name file and also added "SERVERHOST > > interal_host_name" to torque.cfg. Unfortunately, problem sitll remains. > > > > 2. My 2nd try on the installation: > > I removed the first installation, and disabled eth0 which is > > associated with external_host_name, and recompiled torque again with > > the exactly same steps as that in my first try on the installation. > > Everything seems fine. I can create a batch queue and can submit jobs > > which run and terminate normally. However, once I enable eth0 > > (external_host_name), every qmgr command returns "unauthorized > > request". I noticed that the server recognizes me as > > user at external_host_name, whereas the pbs server is set as > > internal_host_name which is also the hostname. I guess this causes the > > "unauthorized" issue, so I made the following settings, by disabling > > eth0 to get the authority on the operation: > > ==== > > qmgr -c 's s acl_hosts += external_host_name' > > qmgr -c 's s managers += root at external_host_name' > > qmgr -c 's s operators += root at external_host_name' > > qmgr -c 's s submit_hosts += external_host_name' > > ==== > > > > After the above commands, I gain the operational access to the > > pbs_server even when eth0 is enabled. However, all the submitted jobs > > are still remain in the Q state. The followings are part of the 'qstat > > -f' command and log files on the server: > > ==== part of 'qstat -f' command ===== > > Job Id: 51.internal_host_name > > Job_Name = STDIN > > Job_Owner = user at exteral_host_name > > job_state = Q > > queue = batch > > server = internal_host_name > > Checkpoint = u > > ctime = Sun Nov 13 19:25:12 2011 > > Error_Path = internal_host_name:/home/liu/STDIN.e51 > > Hold_Types = n > > Join_Path = n > > Keep_Files = n > > Mail_Points = a > > mtime = Sun Nov 13 19:25:12 2011 > > Output_Path = internal_host_name:/home/liu/STDIN.o51 > > =============================== > > > > ==== part of pbs_server log ====== > > 11/13/2011 19:25:05;0002;PBS_Server;Svr;PBS_Server;Torque Server > > Version = 3.0.2, loglevel = 0 > > 11/13/2011 19:25:12;0100;PBS_Server;Job;51.interal_host_name;enqueuing > > into batch, state 1 hop 1 > > 11/13/2011 19:25:12;0008;PBS_Server;Job;51.interal_host_name;Job > > Queued at request of user at external_host_name, owner = > > user at external_host_name, job name = STDIN, queue = batch > > 11/13/2011 19:25:12;0040;PBS_Server;Svr;cddlogin;Scheduler was sent > > the command new > > 11/13/2011 19:25:12;0080;PBS_Server;Req;dis_request_read;req header > > bad, dis error 7 (Premature end of message), type=Connect > > 11/13/2011 19:25:12;0080;PBS_Server;Req;req_reject;Reject reply > > code=15058(Bad DIS based Request Protocol MSG=cannot decode message), > > aux=0, type=Connect, from @ > > 11/13/2011 19:25:12;0002;PBS_Server;Req;dis_reply_write;DIS reply > > failure, -1 > > ========================= > > > > ==== part of pbs_sche log ====== > > 11/13/2011 19:25:12;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::badconn, > > external_host_name on port 762 unauthorized host > > ========================== > > > > As you can see from the above information, although exteral_host_name > > is set as a submit_host, all jobs are still remain in 'Q' state > > because the job owner is user at external_host_name! My question is : > > either 1. how to make the server to accept jobs from > > users at external_host_name? > > or 2. how to make the server to recognize every submitted jobs as > > belonging to user at internal_host_name? > > > > Thanks in advance! > > > > Junjun > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Jason W. Bacon > jwbacon at tds.net > http://personalpages.tds.net/~jwbacon > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111115/9c6d5afd/attachment.html From zhaohscas at yahoo.com.cn Tue Nov 15 02:51:07 2011 From: zhaohscas at yahoo.com.cn (=?utf-8?B?57qi55SfIOi1tQ==?=) Date: Tue, 15 Nov 2011 17:51:07 +0800 (CST) Subject: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. Message-ID: <1321350667.31348.YahooMailClassic@web15606.mail.cnb.yahoo.com> In my case, the pbs_mom file is at the following location: /etc/init.d/pbs_mom And, I've found that the following snippets in it: ---------- # pbs_mom?????? This script will start and stop the PBS Mom # # chkconfig: 345 95 5 # description: TORQUE/PBS is a versatile batch system for SMPs and clusters # ulimit -n 32768 -------------- Then, according to you suggestion, I changed the above codes into the following form: -------- # pbs_mom?????? This script will start and stop the PBS Mom # # chkconfig: 345 95 5 # description: TORQUE/PBS is a versatile batch system for SMPs and clusters # # The following line is added by me ulimit -l unlimited # the following line is the original one in this script: ulimit -n 32768 ----------- Then, I issue the following command to refresh the above changes: zhaohongsheng at node32:~/work/Dr.Zhao/castep_test_myhome> sudo /etc/init.d/pbs_mom restart Shutting down dispatcher Mom: action OK Starting dispatcher Mom: action OK But, after this, the issue is as the same. Regards. --- 11?11?14????, Lloyd Brown ??? ???: Lloyd Brown ??: Re: [torqueusers] The strange issue when submit job with pbs: libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. and so on. ???: torqueusers at supercluster.org ??: 2011?11?14?,??,??10:41 I'm not sure I'd go so far as to say it's definitely unrelated to Torque.? We ran into a similar problem, and in the end it was a limitation imposed by the shell/environment when the pbs_mom started. Since jobs were child processes of the pbs_mom, they inherited the memory limits. Since we're using the RPM packages, and the contrib'd init script, the best solution was to put the following lines in /etc/sysconfig/pbs_mom; if you're using a different environment, etc., you may have to adjust the method you use to do something similar: > #installation-specific limits for pbs_mom > ulimit -l unlimited > ulimit -n 32768 There are a lot of opinions about what the right limits to put here are, as well. Lloyd Brown On 11/13/11 2:25 AM, Stijn De Weirdt wrote: > this is unrelated to torque, but here's the relevant faq entry for openmpi > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > stijn > > On 11/13/2011 09:21 AM, Hongsheng Zhao wrote: >> Hi all, >> >> When I submit a job to my hpc with pbs script, I meet the following err >> and the job cann't start at all: >> >> --------------- >> zhaohongsheng at node32:~/work/Dr.Zhao/castep_test_myhome>? cat stderr >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>? ? ???This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>? ? ???This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>? ? ???This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>? ? ???This will severely limit memory registrations. >> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>? ? ???This will severely limit memory registrations. >> castepexe_mpi.exe: Rank 0:0: MPI_Init: ibv_create_cq() failed >> castepexe_mpi.exe: Rank 0:0: MPI_Init: Can't initialize RDMA device >> castepexe_mpi.exe: Rank 0:0: MPI_Init: MPI BUG: Cannot initialize RDMA >> protocol >> MPI Application rank 0 exited before MPI_Init() with status 1 >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B52C15F6C00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B52C0F5FC8B? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B52C0F3B57A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B52C0FAF03C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B52C0FBB3AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B52C171F184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B197B97CC00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B06AAF7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B05D2E1? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B05F757? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B067FCD? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B0641F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B067ACB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B197BDD71FA? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B197B0641F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B197BDD758D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B197BDD7171? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B197B36028A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B197B2C198E? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B197B33503C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B197B3413AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B197BAA5184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B8D54E6DC00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libpthread.so.0? ? 00002B8D54E6C900? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libibverbs.so? ? ? 00002B8D57B8C3A4? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libibverbs.so? ? ? 00002B8D57B8D06A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D547F1FA7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D547EBE36? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D547D8F0C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D547B3E1D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D547B2999? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D5482603C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B8D548323AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B8D54F96184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B700FF46C00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F634AF7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F6272E1? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F627827? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F629987? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F631FCD? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F62E1F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F631ACB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B70103A11FA? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B700F62E1F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B70103A158D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B70103A1171? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F8A28EC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F8A2A27? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F8E34BF? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F8C2C4B? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F8B230C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F88CE1D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F88B999? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F8FF03C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B700F90B3AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B701006F184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B92E5C7BC00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E5369AF7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E535C2E1? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E535C827? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E535E987? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E5366FCD? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E53631F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E5366ACB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B92E60D61FA? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B92E53631F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B92E60D658D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B92E60D6171? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E55D78EC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E55D7A27? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E56184BF? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E55F7C4B? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E55E730C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E55C1E1D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E55C0999? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E563403C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B92E56403AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B92E5DA4184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002AABE6311C00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libpthread.so.0? ? 00002AABE6310900? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libibverbs.so? ? ? 00002AABE90303A4? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libibverbs.so? ? ? 00002AABE903106A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5C95FA7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5C8FE36? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5C7CF0C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5C57E1D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5C56999? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5CCA03C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AABE5CD63AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002AABE643A184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B0D68F83C00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D68671AF7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D686642E1? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D68666757? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D6866EFCD? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D6866B1F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D6866EACB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B0D693DE1FA? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B0D6866B1F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B0D693DE58D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B0D693DE171? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B0D6896728A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B0D688C898E? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B0D6893C03C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B0D689483AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B0D690AC184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002B83A77EFC00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6ED34E0? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6ED39B3? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6ED5028? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6EDB2A5? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6ED71F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6EDAACB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B83A7C4A1FA? Unknown? ? ? ? ? ? ???Unknown? Unknown >> ld-linux-x86-64.s? 00002B83A6ED71F6? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B83A7C4A58D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libdl.so.2? ? ? ???00002B83A7C4A171? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A714B8EC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A714BA27? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A718C4BF? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A716BC4B? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A715B30C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A7135E1D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A7134999? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A71A803C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B83A71B43AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B83A7918184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> libpthread.so.0? ? 00002AC2DC492C00? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libpthread.so.0? ? 00002AC2DC491900? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libibverbs.so? ? ? 00002AC2DF1B13A4? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libibverbs.so? ? ? 00002AC2DF1B206A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBE16FA7? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBE10E36? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBDFDF0C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBDD8E1D? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBDD7999? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBE4B03C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002AC2DBE573AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002AC2DC5BB184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> forrtl: error (78): process killed (SIGTERM) >> Image? ? ? ? ? ? ? PC? ? ? ? ? ? ? ? Routine? ? ? ? ? ? Line >> Source >> plugin002_004_002? 00002B195294BA80? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B1951975108? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B19518D698E? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B195194A03C? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libmpi.so.1? ? ? ? 00002B19519563AB? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 00000000004D6539? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000416404? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412DDC? Unknown? ? ? ? ? ? ???Unknown? Unknown >> libc.so.6? ? ? ? ? 00002B19520BA184? Unknown? ? ? ? ? ? ???Unknown? Unknown >> castepexe_mpi.exe? 0000000000412D0A? Unknown? ? ? ? ? ? ???Unknown? Unknown >> mpirun: Broken pipe >> ----------------- >> >> Could you please give me some hints?? Thanks in advance. >> >> Regards > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111115/4e623a20/attachment-0001.html From knielson at adaptivecomputing.com Tue Nov 15 08:29:02 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 15 Nov 2011 08:29:02 -0700 (MST) Subject: [torqueusers] [torquedev] TORQUE dinner at SuperComputing In-Reply-To: <4EC079AB.4090405@unimelb.edu.au> Message-ID: > > Hi all, > > On 08/11/11 04:33, Ken Nielson wrote: > > > I am hoping to continue a tradition of meeting with > > TORQUE developers and users for dinner while at > > SuperComputing this year. Is there anyone interested? > Hi all, I got in yesterday afternoon and had to go directly to the show floor. I got to meet some long time users for the first time as well as meet up again with others. It was great. This is why SuperComputing is such a great event. I still do not have a place to go eat. Stay tuned. Please post your suggestions. We want something that is within walking distance of the convention center. We are still on for 7:00. Come to our booth for information if you can't get to your e-mail. Regards Ken From stevenx.a.duchene at intel.com Tue Nov 15 17:39:20 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 15 Nov 2011 16:39:20 -0800 Subject: [torqueusers] [torquedev] TORQUE dinner at SuperComputing In-Reply-To: References: <4EC079AB.4090405@unimelb.edu.au> Message-ID: Is there going to be any Torque stuff going on tomorrow (Wednesday)? -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, November 15, 2011 7:29 AM To: Torque Developers mailing list; Torque Users Mailing List Subject: Re: [torqueusers] [torquedev] TORQUE dinner at SuperComputing > > Hi all, > > On 08/11/11 04:33, Ken Nielson wrote: > > > I am hoping to continue a tradition of meeting with > > TORQUE developers and users for dinner while at > > SuperComputing this year. Is there anyone interested? > Hi all, I got in yesterday afternoon and had to go directly to the show floor. I got to meet some long time users for the first time as well as meet up again with others. It was great. This is why SuperComputing is such a great event. I still do not have a place to go eat. Stay tuned. Please post your suggestions. We want something that is within walking distance of the convention center. We are still on for 7:00. Come to our booth for information if you can't get to your e-mail. Regards Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Tue Nov 15 17:45:38 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 15 Nov 2011 16:45:38 -0800 Subject: [torqueusers] No current maintained web monitoring interfaces, job_monarch and pbswebmon broken In-Reply-To: References: <4EC079AB.4090405@unimelb.edu.au> Message-ID: I spent all day trying to get job_monarch working and only meet with dismal failure. The jobmond.py daemon runs but does not seem to communicate with ganglia or torque. After much debugging and looking at the underlying python code I finally gave up. The various example python scripts packaged with the pbs_python module source seem to work just fine so I know the pbs_python bits work OK. I then downloaded pbswebmon only to discover it has never been updated for the 4.X versions of the pbs_python package and gives an error when it tries to query jobs. It seems that neither one of these web based job control monitoring packages is maintained any longer. Are there any current web based open source torque monitoring packages available? -- Steven DuChene From knielson at adaptivecomputing.com Tue Nov 15 18:00:01 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Tue, 15 Nov 2011 18:00:01 -0700 (MST) Subject: [torqueusers] SuperComputing TORQUE Dinner In-Reply-To: <0f6bdec2-3d0b-4680-a637-bf2f1e158d5b@mail> Message-ID: Hi all, We finally have a place for dinner tonight at 7:00. The Restaruant is called "The Market's Bar & Grill". The address on the card says upstairs at Pike & Pike. It is at the end of Pike street down by Pike's market. We are meeting in the lobby of the Sheraton at 7:00 to walk down. To get better directions to the restaurant you can go to their web-site at pikeplacebarandgrill.com. We have a reservation under my name (Ken Nielson) You can call me at 801-310-2813. See you tonight Ken From stevenx.a.duchene at intel.com Tue Nov 15 18:06:15 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 15 Nov 2011 17:06:15 -0800 Subject: [torqueusers] SuperComputing TORQUE Dinner In-Reply-To: References: <0f6bdec2-3d0b-4680-a637-bf2f1e158d5b@mail> Message-ID: The correct web site is www.pikeplacemarket.org The one given below, pikeplacebarandgrill.com does not work. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, November 15, 2011 5:00 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: [torqueusers] SuperComputing TORQUE Dinner Hi all, We finally have a place for dinner tonight at 7:00. The Restaruant is called "The Market's Bar & Grill". The address on the card says upstairs at Pike & Pike. It is at the end of Pike street down by Pike's market. We are meeting in the lobby of the Sheraton at 7:00 to walk down. To get better directions to the restaurant you can go to their web-site at pikeplacebarandgrill.com. We have a reservation under my name (Ken Nielson) You can call me at 801-310-2813. See you tonight Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Tue Nov 15 18:15:48 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 15 Nov 2011 17:15:48 -0800 Subject: [torqueusers] SuperComputing TORQUE Dinner In-Reply-To: References: <0f6bdec2-3d0b-4680-a637-bf2f1e158d5b@mail> Message-ID: Actually the URL is this: http://www.pikeplacemarket.org/explore_the_market/market_map#343 but I also found a restaurant review site here that had reviews of this restaurant and it sounds like most people give it pretty awful reviews: http://www.yelp.com/biz/pike-place-bar-and-grill-seattle -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, November 15, 2011 5:06 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] SuperComputing TORQUE Dinner The correct web site is www.pikeplacemarket.org The one given below, pikeplacebarandgrill.com does not work. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, November 15, 2011 5:00 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: [torqueusers] SuperComputing TORQUE Dinner Hi all, We finally have a place for dinner tonight at 7:00. The Restaruant is called "The Market's Bar & Grill". The address on the card says upstairs at Pike & Pike. It is at the end of Pike street down by Pike's market. We are meeting in the lobby of the Sheraton at 7:00 to walk down. To get better directions to the restaurant you can go to their web-site at pikeplacebarandgrill.com. We have a reservation under my name (Ken Nielson) You can call me at 801-310-2813. See you tonight Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From stevenx.a.duchene at intel.com Tue Nov 15 18:30:15 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Tue, 15 Nov 2011 17:30:15 -0800 Subject: [torqueusers] SuperComputing TORQUE Dinner In-Reply-To: References: <0f6bdec2-3d0b-4680-a637-bf2f1e158d5b@mail> Message-ID: BTW, in looking around I found the following place that seems to have GREAT reviews on yelp. Metropolitan Grill 820 2nd Ave Seattle, WA 98104 (206) 624-3287 www.themetropolitangrill.com It seems to be no further from the Sheraton hotel in downtown Seattle then the other place. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, November 15, 2011 5:16 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] SuperComputing TORQUE Dinner Actually the URL is this: http://www.pikeplacemarket.org/explore_the_market/market_map#343 but I also found a restaurant review site here that had reviews of this restaurant and it sounds like most people give it pretty awful reviews: http://www.yelp.com/biz/pike-place-bar-and-grill-seattle -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of DuChene, StevenX A Sent: Tuesday, November 15, 2011 5:06 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: Re: [torqueusers] SuperComputing TORQUE Dinner The correct web site is www.pikeplacemarket.org The one given below, pikeplacebarandgrill.com does not work. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Ken Nielson Sent: Tuesday, November 15, 2011 5:00 PM To: Torque Users Mailing List; Torque Developers mailing list Subject: [torqueusers] SuperComputing TORQUE Dinner Hi all, We finally have a place for dinner tonight at 7:00. The Restaruant is called "The Market's Bar & Grill". The address on the card says upstairs at Pike & Pike. It is at the end of Pike street down by Pike's market. We are meeting in the lobby of the Sheraton at 7:00 to walk down. To get better directions to the restaurant you can go to their web-site at pikeplacebarandgrill.com. We have a reservation under my name (Ken Nielson) You can call me at 801-310-2813. See you tonight Ken _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From zhaohscas at yahoo.com.cn Tue Nov 15 21:07:12 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 16 Nov 2011 12:07:12 +0800 Subject: [torqueusers] Want to let pbs script use different cores from different nodes. Message-ID: <4EC336F0.6070603@yahoo.com.cn> Hi all, Suppose I've two nodes with one have 8 cores and the others have 16 cores. I want to write a pbs script which can use these 24 cores from these two nodes at the same time. What's the the corresponding command should be used in the pbs script? Any hints will be highly appreciated. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From basv at sara.nl Tue Nov 15 22:41:58 2011 From: basv at sara.nl (Bas van der Vlies) Date: Wed, 16 Nov 2011 06:41:58 +0100 Subject: [torqueusers] No current maintained web monitoring interfaces, job_monarch and pbswebmon broken In-Reply-To: References: <4EC079AB.4090405@unimelb.edu.au> Message-ID: Steven, We still use it at our site and it's maintained. Just go the jobmon_arch website and report your problem. The site is: * https://subtrac.sara.nl/oss/jobmonarch Example: * http://ganglia.sara.nl/?c=GINA%20Cluster&m=load_one&r=hour&s=descending&hc=4&mc=2 -- Bas van der Vlies On Nov 15, 2011, at 16:46, "DuChene, StevenX A" wrote: > I spent all day trying to get job_monarch working and only meet with dismal failure. The jobmond.py daemon runs but does not seem to communicate with ganglia or torque. After much debugging and looking at the underlying python code I finally gave up. > > The various example python scripts packaged with the pbs_python module source seem to work just fine so I know the pbs_python bits work OK. > > I then downloaded pbswebmon only to discover it has never been updated for the 4.X versions of the pbs_python package and gives an error when it tries to query jobs. > > It seems that neither one of these web based job control monitoring packages is maintained any longer. Are there any current web based open source torque monitoring packages available? > -- > Steven DuChene > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gabe at msi.umn.edu Tue Nov 15 22:53:58 2011 From: gabe at msi.umn.edu (Gabe Turner) Date: Tue, 15 Nov 2011 23:53:58 -0600 Subject: [torqueusers] Want to let pbs script use different cores from different nodes. In-Reply-To: <4EC336F0.6070603@yahoo.com.cn> References: <4EC336F0.6070603@yahoo.com.cn> Message-ID: <20111116055358.GB9014@blackice.msi.umn.edu> On Wed, Nov 16, 2011 at 12:07:12PM +0800, Hongsheng Zhao wrote: > Hi all, > > Suppose I've two nodes with one have 8 cores and the others have 16 > cores. I want to write a pbs script which can use these 24 cores from > these two nodes at the same time. What's the the corresponding command > should be used in the pbs script? Any hints will be highly appreciated. > The following should get you 8 cores on one node, and 24 on another node: #PBS -l nodes=1:ppn=8+1:ppn=16 HTH, Gabe -- Gabe Turner gabe at msi.umn.edu HPC Systems Administrator, University of Minnesota Supercomputing Institute http://www.msi.umn.edu From r.oostenveld at donders.ru.nl Wed Nov 16 01:52:20 2011 From: r.oostenveld at donders.ru.nl (Robert Oostenveld) Date: Wed, 16 Nov 2011 09:52:20 +0100 Subject: [torqueusers] qrerun fails due to Unauthorized Request Message-ID: <90DC71AA-3F6F-452E-B53D-6BE89457AEE4@donders.ru.nl> Dear torque users, I am trying to use qrerun in a shell script to deal with the (potential) limit in available MATLAB licenses. Let me shortly outline the idea before explaining the problem. I have a shell script that starts MATLAB with the "-r " option for a MATLAB script. In case there is no license available, MATLAB returns immediately with a descriptive error about the license failure. I would like to catch that error and if it happens, issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job for execution at a later time. Note that I am aware of the ability to configure floating resources in moab, but I am using maui. Furthermore, the floating resources for the Matlab license don't optimally represent the license requirements for scheduling multiple jobs by the same user on a multicore machine. Hence I prefer to use qrerun instead of making the license a managed resource. The problem I run into can be summarized in the following snippet from the command line. I schedule a simple job that subsequenty starts running on one of the execution hosts: roboos at mentat001> echo sleep 1000 | qsub 45254.dccn-l014.dccn.nl Then I try to use qrerun, first as regular user then as super user (which I normally would not do of course): roboos at mentat001> qrerun 45254 qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl roboos at mentat001> sudo qrerun 45254 qrerun: Unauthorized Request MSG=operation not permitted 45254.dccn-l014.dccn.nl So as root/administrative user I am also not allowed to do it from the client machine. I am able to log in directly on the torque server, where as regular user I am also not allowed to qrerun. It is not a general failure of qrerun, since the the root user on the torque server is allowed to use it: roboos at mentat001> ssh torque roboos at torque> qrerun 45254 qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl roboos at torque> sudo qrerun 45254 after which the job is correctly requeued and starts over again. To provide some info from the log files: as regular user I get the following message in /var/spool/torque/server_logs 11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply code=15018(Request invalid for state of job), aux=0, type=RerunJob, from roboos at mentat001.dccn.nl and as root on the torque server I get 11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply code=15018(Request invalid for state of job), aux=0, type=RerunJob, from root at dccn-l014.dccn.nl The log mesaage is basically the same. In the log message on the execution host I cannot find anything that pertains to the failed qrerun request. Does anyone have an idea on what might be the problem for the regular user not being allowed to restart the job? I tried the same thing on a different torque cluster (not managed by me) that I have access to, and also there it failed. best regards, Robert ----------------------------------------------------------- Robert Oostenveld, PhD Senior Researcher & MEG Physicist Donders Institute for Brain, Cognition and Behaviour Centre for Cognitive Neuroimaging Radboud University Nijmegen tel.: +31 (0)24 3619695 e-mail: r.oostenveld at donders.ru.nl web: http://www.ru.nl/neuroimaging skype: r.oostenveld ----------------------------------------------------------- From zhaohscas at yahoo.com.cn Wed Nov 16 05:16:48 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Wed, 16 Nov 2011 20:16:48 +0800 Subject: [torqueusers] Want to let pbs script use different cores from different nodes. In-Reply-To: <20111116055358.GB9014@blackice.msi.umn.edu> References: <4EC336F0.6070603@yahoo.com.cn> <20111116055358.GB9014@blackice.msi.umn.edu> Message-ID: <4EC3A9B0.2010202@yahoo.com.cn> On 11/16/2011 01:53 PM, Gabe Turner wrote: > The following should get you 8 cores on one node, and 24 on another node: Do you mean 8 cores on one node, and 16 on another node by the following code? > > #PBS -l nodes=1:ppn=8+1:ppn=16 Thanks a lot, I'll try it. Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China From miguel.gila at cscs.ch Wed Nov 16 09:24:56 2011 From: miguel.gila at cscs.ch (Gila Arrondo Miguel Angel) Date: Wed, 16 Nov 2011 16:24:56 +0000 Subject: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox In-Reply-To: <8643FD43-67BA-4EE0-AA89-08975325E6AD@donders.ru.nl> References: <74E21B4E-4520-4BD9-A4BB-5D7CF8DC0BE4@cscs.ch> <8643FD43-67BA-4EE0-AA89-08975325E6AD@donders.ru.nl> Message-ID: Dear Robert, Many thanks for your answer. We've made sure that the keys are okay, as well as disabling hoskeychecking to test it. We've also tunned some TCP values with the hope that the transfers would not fail: net.core.rmem_max = 87380000 net.core.wmem_max = 65536000 net.ipv4.tcp_rmem = 8192 873800 87380000 net.ipv4.tcp_wmem = 4096 655360 65536000 But none of this has worked. The WNs are connecting to the CREAMs via a vnic over infiniband. We know it's not the best scenario to debug network issues... but we have sustained gridftp connections along with a mix of other protocols and we have seen no problems so far. This must be something related to torque/ssh... Any other ideas?? Cheers, Miguel On Nov 11, 2011, at 9:41 AM, Robert Oostenveld wrote: > Dear Miguel, > > > On 10 Nov 2011, at 16:37, Gila Arrondo Miguel Angel wrote: >> We are seeing a lot of "pbs_mom: scp" transfer errors in our /var/log/messages, but the files mentioned in these errors are there and are accessible. >> >> This is an example of error: >> ... > > > I don't know whether it may be related, but we have also had a problem with scp which I tracked down to the users having an incorrect (i.e. not up-to-date) .ssh/known_hosts file in ther NFS shared home directory. > > We have many non-torque-cluster linux computers from which jobs can be submitted, and sometimes these are updated/reinstalled which invalidates the ssh host key that was previously assigned to their IP address. The users that had a correct known_hosts or that had specified StrictHostKeyChecking=no in their .ssh/config file did not have any problems, but the users that had an outdated known_hosts did encounter problems (but then only when submitting from one of the nodes where the host key had changed). > > The consequence was that on the torque execute hosts scp would look into the user's known_hosts, and depending from where the job was submitted would find a correct (for some users) or an incorrect (for other users) host key for the submit client. > > A possible solution would have been to ensure that all users' known_hosts was correct or that all users had StrictHostKeyChecking=no. But our specific solution was to specify > $usecp *:/home /home > in the /var/spool/torque/mom_priv/config, as scp was not needed anyway because of our shared NFS home directory. > > best regards, > Robert > > > ----------------------------------------------------------- > Robert Oostenveld, PhD > Senior Researcher & MEG Physicist > Donders Institute for Brain, Cognition and Behaviour > Centre for Cognitive Neuroimaging > Radboud University Nijmegen > tel.: +31 (0)24 3619695 > e-mail: r.oostenveld at donders.ru.nl > web: http://www.ru.nl/neuroimaging > skype: r.oostenveld > ----------------------------------------------------------- > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Miguel Gila CSCS Swiss National Supercomputing Centre HPC Solutions Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3239 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111116/965fa968/attachment-0001.bin From samuel at unimelb.edu.au Wed Nov 16 19:29:44 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 17 Nov 2011 13:29:44 +1100 Subject: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox In-Reply-To: References: <74E21B4E-4520-4BD9-A4BB-5D7CF8DC0BE4@cscs.ch> <8643FD43-67BA-4EE0-AA89-08975325E6AD@donders.ru.nl> Message-ID: <4EC47198.1040709@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > Many thanks for your answer. We've made sure that the > keys are okay, as well as disabling hoskeychecking to > test it. Can you try and scp as that user to see whether it complains about anything else ? It may be that it is prompting the user to accept a host key if they don't already have it. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS =VPqK -----END PGP SIGNATURE----- From miguel.gila at cscs.ch Thu Nov 17 00:55:50 2011 From: miguel.gila at cscs.ch (Gila Arrondo Miguel Angel) Date: Thu, 17 Nov 2011 07:55:50 +0000 Subject: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox In-Reply-To: <4EC47198.1040709@unimelb.edu.au> References: <74E21B4E-4520-4BD9-A4BB-5D7CF8DC0BE4@cscs.ch> <8643FD43-67BA-4EE0-AA89-08975325E6AD@donders.ru.nl> <4EC47198.1040709@unimelb.edu.au> Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE@cscs.ch> Hi Chris, I've done that in many WNs and with different users, so I don't think that is be the issue. I've also checked for scheduled tasks that interact with the ssh keys, but the errors happen at random times, not when the scheduled tasks run... :-S I'm running out of options here. Cheers, Miguel On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > >> Many thanks for your answer. We've made sure that the >> keys are okay, as well as disabling hoskeychecking to >> test it. > > Can you try and scp as that user to see whether it > complains about anything else ? > > It may be that it is prompting the user to accept a > host key if they don't already have it. > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > =VPqK > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Miguel Gila CSCS Swiss National Supercomputing Centre HPC Solutions Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3239 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/214ea9d6/attachment.bin From RB.Ezhilalan at hse.ie Thu Nov 17 03:14:32 2011 From: RB.Ezhilalan at hse.ie (RB. Ezhilalan (Principal Physicist, CUH)) Date: Thu, 17 Nov 2011 10:14:32 -0000 Subject: [torqueusers] Parallel processing for MC code Message-ID: Hi All, I've been trying to set up Torque queuing system on two SUSE10.1 linux PCs (PIII!). Installed the linux on both PCs, exported home directory containing BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH password less communication. All seems to be working fine. Downloaded latest version of Torque (number not handy) installed PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. PBS 'nodes' file was created as per guidelines, PBS_SERVER and QUEUE attributes were set as default. Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are free. I am not sure whether this confirms PBS/Torque set up correctly. I was able to run an executable BEAMnrc user code in batch mode i.e using 'exb' command aliased to 'qsub' and sources a built in job script file with option p=1 (single job). To split the jobs in to two, so that it runs in parallel on the two PCs, option p=2 should be issued. However, what I noticed was, the job ran twice on the first PC (PC1) but not on both. I can't figure out what went wrong, I suspect PBS setup could have some issues, May be I can try running the job specifically on PC2 if so what command I need to give? I would be grateful for any advice! Kind Regards, Ezhilalan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/06e4a798/attachment.html From jwbacon at tds.net Thu Nov 17 09:18:18 2011 From: jwbacon at tds.net (Jason Bacon) Date: Thu, 17 Nov 2011 10:18:18 -0600 Subject: [torqueusers] Parallel processing for MC code In-Reply-To: References: Message-ID: <4EC533CA.2000902@tds.net> How many cores does PC1 have? Note that Torque schedules cores, not computers, unless you specifically tell it to with resource requirements. Regards, -J On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > Hi All, > > I?ve been trying to set up Torque queuing system on two SUSE10.1 linux > PCs (PIII!). > > Installed the linux on both PCs, exported home directory containing > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > password less communication. All seems to be working fine. > > Downloaded latest version of Torque (number not handy) installed > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and QUEUE > attributes were set as default. > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are free. > I am not sure whether this confirms PBS/Torque set up correctly. > > I was able to run an executable BEAMnrc user code in batch mode i.e > using ?exb? command aliased to ?qsub? and sources a built in job > script file with option p=1 (single job). > > To split the jobs in to two, so that it runs in parallel on the two > PCs, option p=2 should be issued. However, what I noticed was, the job > ran twice on the first PC (PC1) but not on both. > > I can?t figure out what went wrong, I suspect PBS setup could have > some issues, May be I can try running the job specifically on PC2 if > so what command I need to give? > > I would be grateful for any advice! > > Kind Regards, > > Ezhilalan > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From steve.traylen at cern.ch Thu Nov 17 10:19:14 2011 From: steve.traylen at cern.ch (Steve Traylen) Date: Thu, 17 Nov 2011 18:19:14 +0100 Subject: [torqueusers] File staging syntax In-Reply-To: <09eea517-55d2-41ed-a7a3-6f73587943db@mail> References: <1e5b02c0-6a48-4bde-9247-964ca25edfea@zimbra.scai.fraunhofer.de> <09eea517-55d2-41ed-a7a3-6f73587943db@mail> Message-ID: On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson wrote> > Andr?, > > I have not yet had time to reproduce this. I did look through the change log and there are two suspects. One is in 2.5.6, a fix for Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > That is as far as I am right now. I will try to get to this as soon as I can. Hi Ken, Did you manage to track this down. It's currently making upgrading a pain. Steve. -- Steve Traylen From stevenx.a.duchene at intel.com Thu Nov 17 12:38:07 2011 From: stevenx.a.duchene at intel.com (DuChene, StevenX A) Date: Thu, 17 Nov 2011 11:38:07 -0800 Subject: [torqueusers] No current maintained web monitoring interfaces, job_monarch and pbswebmon broken In-Reply-To: References: <4EC079AB.4090405@unimelb.edu.au> Message-ID: As far as I can tell there have not been an updated release of job_monarch since 2008. I bet there have been no updates to take into account the XML changes in newer versions of Ganglia. In fact this exact question has been discussed on the Jobmonarch-users mailing list but there has been no responses from anyone on the development team. Also I do not see anyone from the development team participating in the support discussion lists on sourceforge which are linked to from the jab_monarch page you provided below. In fact one person posted the following message to the Jobmonarch-users mailing list on 2011-04-05: "I take that back. I too have tried to get it to work and have been unsuccessful. I wanted to get my hands on the 'unstable' version (Job Monarch version 1.0-rc1-SVN) running on sara, but nobody there seems to respond. Even when I spoke with them at SuperComputing." As a result I do not consider this an active or maintained opensource project. Active opensource projects do regular updates & code releases and they actually respond to users' questions. -- Steven DuChene -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Bas van der Vlies Sent: Tuesday, November 15, 2011 9:42 PM To: Torque Users Mailing List Subject: Re: [torqueusers] No current maintained web monitoring interfaces, job_monarch and pbswebmon broken Steven, We still use it at our site and it's maintained. Just go the jobmon_arch website and report your problem. The site is: * https://subtrac.sara.nl/oss/jobmonarch Example: * http://ganglia.sara.nl/?c=GINA%20Cluster&m=load_one&r=hour&s=descending&hc=4&mc=2 -- Bas van der Vlies On Nov 15, 2011, at 16:46, "DuChene, StevenX A" wrote: > I spent all day trying to get job_monarch working and only meet with dismal failure. The jobmond.py daemon runs but does not seem to communicate with ganglia or torque. After much debugging and looking at the underlying python code I finally gave up. > > The various example python scripts packaged with the pbs_python module source seem to work just fine so I know the pbs_python bits work OK. > > I then downloaded pbswebmon only to discover it has never been updated for the 4.X versions of the pbs_python package and gives an error when it tries to query jobs. > > It seems that neither one of these web based job control monitoring packages is maintained any longer. Are there any current web based open source torque monitoring packages available? > -- > Steven DuChene > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From lance at quantumbioinc.com Thu Nov 17 15:47:31 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Thu, 17 Nov 2011 17:47:31 -0500 Subject: [torqueusers] procs= not working as documented Message-ID: <5BE7BF30-0EAF-4DF0-B4E2-DECA32FEB99C@quantumbioinc.com> Hello All- It appears that when running with the following specs, the procs= option does not actually work as expected. ========================================== #PBS -S /bin/bash #PBS -l procs=60 #PBS -l pmem=700mb #PBS -l walltime=744:00:00 #PBS -j oe #PBS -q batch torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) ========================================== If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. Thank you for your time! -Lance From brockp at umich.edu Thu Nov 17 18:29:17 2011 From: brockp at umich.edu (Brock Palen) Date: Thu, 17 Nov 2011 17:29:17 -0800 Subject: [torqueusers] procs= not working as documented In-Reply-To: <5BE7BF30-0EAF-4DF0-B4E2-DECA32FEB99C@quantumbioinc.com> Message-ID: <20111118012930.C635E83A8026@mail.adaptivecomputing.com> Does maui only see one cpu or does mpiexec only see one cpu? Brock Palen (734)936-1985 brockp at umich.edu - Sent from my Palm Pre, please excuse typos On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: Hello All- It appears that when running with the following specs, the procs= option does not actually work as expected. ========================================== #PBS -S /bin/bash #PBS -l procs=60 #PBS -l pmem=700mb #PBS -l walltime=744:00:00 #PBS -j oe #PBS -q batch torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) ========================================== If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. Thank you for your time! -Lance _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/e829cfd3/attachment.html From RB.Ezhilalan at hse.ie Fri Nov 18 02:48:10 2011 From: RB.Ezhilalan at hse.ie (RB. Ezhilalan (Principal Physicist, CUH)) Date: Fri, 18 Nov 2011 09:48:10 -0000 Subject: [torqueusers] Parallel processing for MC code In-Reply-To: References: Message-ID: Hi Jason, PC1 (linux-01) is a single core PC like PC2, I defined the server_priv/nodes file as; Linux-01 Linux-02 As you have mentioned may be resource requirement needs to be properly set up. Do you have any suggestions? Many thanks, Ezhilalan -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: 17 November 2011 17:20 To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 88, Issue 14 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: Random SCP errors when transfering to/from CREAM sandbox (Christopher Samuel) 2. Re: Random SCP errors when transfering to/from CREAM sandbox (Gila Arrondo Miguel Angel) 3. Parallel processing for MC code (RB. Ezhilalan (Principal Physicist, CUH)) 4. Re: Parallel processing for MC code (Jason Bacon) 5. Re: File staging syntax (Steve Traylen) ---------------------------------------------------------------------- Message: 1 Date: Thu, 17 Nov 2011 13:29:44 +1100 From: Christopher Samuel Subject: Re: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox To: torqueusers at supercluster.org Message-ID: <4EC47198.1040709 at unimelb.edu.au> Content-Type: text/plain; charset=ISO-8859-1 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > Many thanks for your answer. We've made sure that the > keys are okay, as well as disabling hoskeychecking to > test it. Can you try and scp as that user to see whether it complains about anything else ? It may be that it is prompting the user to accept a host key if they don't already have it. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS =VPqK -----END PGP SIGNATURE----- ------------------------------ Message: 2 Date: Thu, 17 Nov 2011 07:55:50 +0000 From: "Gila Arrondo Miguel Angel" Subject: Re: [torqueusers] Random SCP errors when transfering to/from CREAM sandbox To: Torque Users Mailing List Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch> Content-Type: text/plain; charset="us-ascii" Hi Chris, I've done that in many WNs and with different users, so I don't think that is be the issue. I've also checked for scheduled tasks that interact with the ssh keys, but the errors happen at random times, not when the scheduled tasks run... :-S I'm running out of options here. Cheers, Miguel On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > >> Many thanks for your answer. We've made sure that the >> keys are okay, as well as disabling hoskeychecking to >> test it. > > Can you try and scp as that user to see whether it > complains about anything else ? > > It may be that it is prompting the user to accept a > host key if they don't already have it. > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > =VPqK > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Miguel Gila CSCS Swiss National Supercomputing Centre HPC Solutions Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3239 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2 14ea9d6/attachment-0001.bin ------------------------------ Message: 3 Date: Thu, 17 Nov 2011 10:14:32 -0000 From: "RB. Ezhilalan (Principal Physicist, CUH)" Subject: [torqueusers] Parallel processing for MC code To: torqueusers at supercluster.org Message-ID: Content-Type: text/plain; charset="us-ascii" Hi All, I've been trying to set up Torque queuing system on two SUSE10.1 linux PCs (PIII!). Installed the linux on both PCs, exported home directory containing BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH password less communication. All seems to be working fine. Downloaded latest version of Torque (number not handy) installed PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. PBS 'nodes' file was created as per guidelines, PBS_SERVER and QUEUE attributes were set as default. Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are free. I am not sure whether this confirms PBS/Torque set up correctly. I was able to run an executable BEAMnrc user code in batch mode i.e using 'exb' command aliased to 'qsub' and sources a built in job script file with option p=1 (single job). To split the jobs in to two, so that it runs in parallel on the two PCs, option p=2 should be issued. However, what I noticed was, the job ran twice on the first PC (PC1) but not on both. I can't figure out what went wrong, I suspect PBS setup could have some issues, May be I can try running the job specifically on PC2 if so what command I need to give? I would be grateful for any advice! Kind Regards, Ezhilalan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0 6e4a798/attachment-0001.html ------------------------------ Message: 4 Date: Thu, 17 Nov 2011 10:18:18 -0600 From: Jason Bacon Subject: Re: [torqueusers] Parallel processing for MC code To: Torque Users Mailing List Message-ID: <4EC533CA.2000902 at tds.net> Content-Type: text/plain; charset=windows-1252; format=flowed How many cores does PC1 have? Note that Torque schedules cores, not computers, unless you specifically tell it to with resource requirements. Regards, -J On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > Hi All, > > I?ve been trying to set up Torque queuing system on two SUSE10.1 linux > PCs (PIII!). > > Installed the linux on both PCs, exported home directory containing > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > password less communication. All seems to be working fine. > > Downloaded latest version of Torque (number not handy) installed > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and QUEUE > attributes were set as default. > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are free. > I am not sure whether this confirms PBS/Torque set up correctly. > > I was able to run an executable BEAMnrc user code in batch mode i.e > using ?exb? command aliased to ?qsub? and sources a built in job > script file with option p=1 (single job). > > To split the jobs in to two, so that it runs in parallel on the two > PCs, option p=2 should be issued. However, what I noticed was, the job > ran twice on the first PC (PC1) but not on both. > > I can?t figure out what went wrong, I suspect PBS setup could have > some issues, May be I can try running the job specifically on PC2 if > so what command I need to give? > > I would be grateful for any advice! > > Kind Regards, > > Ezhilalan > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Message: 5 Date: Thu, 17 Nov 2011 18:19:14 +0100 From: Steve Traylen Subject: Re: [torqueusers] File staging syntax To: Torque Users Mailing List Message-ID: Keywords: CERN SpamKiller Note: -50 Content-Type: text/plain; charset="ISO-8859-1" On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson wrote> > Andr?, > > I have not yet had time to reproduce this. I did look through the change log and there are two suspects. One is in 2.5.6, a fix for Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > That is as far as I am right now. I will try to get to this as soon as I can. Hi Ken, Did you manage to track this down. It's currently making upgrading a pain. Steve. -- Steve Traylen ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 88, Issue 14 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111118/92cb300a/attachment-0001.html From jwbacon at tds.net Fri Nov 18 06:57:23 2011 From: jwbacon at tds.net (Jason Bacon) Date: Fri, 18 Nov 2011 07:57:23 -0600 Subject: [torqueusers] Parallel processing for MC code In-Reply-To: References: Message-ID: <4EC66443.3080608@tds.net> I was only wondering if you had "np=2" in the Linux-01 entry, or if Torque was configured to autodetect the number of cores and there were two. That would have explained the scheduling behavior. Regards, -J On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > Hi Jason, > > PC1 (linux-01) is a single core PC like PC2, I defined the > server_priv/nodes file as; > > Linux-01 > > Linux-02 > > As you have mentioned may be resource requirement needs to be properly > set up. Do you have any suggestions? > > Many thanks, > > Ezhilalan > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > torqueusers-request at supercluster.org > Sent: 17 November 2011 17:20 > To: torqueusers at supercluster.org > Subject: torqueusers Digest, Vol 88, Issue 14 > > Send torqueusers mailing list submissions to > > torqueusers at supercluster.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.supercluster.org/mailman/listinfo/torqueusers > > or, via email, send a message with subject or body 'help' to > > torqueusers-request at supercluster.org > > You can reach the person managing the list at > > torqueusers-owner at supercluster.org > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of torqueusers digest..." > > Today's Topics: > > 1. Re: Random SCP errors when transfering to/from CREAM sandbox > > (Christopher Samuel) > > 2. Re: Random SCP errors when transfering to/from CREAM sandbox > > (Gila Arrondo Miguel Angel) > > 3. Parallel processing for MC code > > (RB. Ezhilalan (Principal Physicist, CUH)) > > 4. Re: Parallel processing for MC code (Jason Bacon) > > 5. Re: File staging syntax (Steve Traylen) > > ---------------------------------------------------------------------- > > Message: 1 > > Date: Thu, 17 Nov 2011 13:29:44 +1100 > > From: Christopher Samuel > > Subject: Re: [torqueusers] Random SCP errors when transfering to/from > > CREAM sandbox > > To: torqueusers at supercluster.org > > Message-ID: <4EC47198.1040709 at unimelb.edu.au> > > Content-Type: text/plain; charset=ISO-8859-1 > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > Many thanks for your answer. We've made sure that the > > > keys are okay, as well as disabling hoskeychecking to > > > test it. > > Can you try and scp as that user to see whether it > > complains about anything else ? > > It may be that it is prompting the user to accept a > > host key if they don't already have it. > > cheers, > > Chris > > - -- > > Christopher Samuel - Senior Systems Administrator > > VLSCI - Victorian Life Sciences Computation Initiative > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.11 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > =VPqK > > -----END PGP SIGNATURE----- > > ------------------------------ > > Message: 2 > > Date: Thu, 17 Nov 2011 07:55:50 +0000 > > From: "Gila Arrondo Miguel Angel" > > Subject: Re: [torqueusers] Random SCP errors when transfering to/from > > CREAM sandbox > > To: Torque Users Mailing List > > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch> > > Content-Type: text/plain; charset="us-ascii" > > Hi Chris, > > I've done that in many WNs and with different users, so I don't think > that is be the issue. I've also checked for scheduled tasks that > interact with the ssh keys, but the errors happen at random times, not > when the scheduled tasks run... :-S > > I'm running out of options here. > > Cheers, > > Miguel > > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > > Hash: SHA1 > > > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > >> Many thanks for your answer. We've made sure that the > > >> keys are okay, as well as disabling hoskeychecking to > > >> test it. > > > > > > Can you try and scp as that user to see whether it > > > complains about anything else ? > > > > > > It may be that it is prompting the user to accept a > > > host key if they don't already have it. > > > > > > cheers, > > > Chris > > > - -- > > > Christopher Samuel - Senior Systems Administrator > > > VLSCI - Victorian Life Sciences Computation Initiative > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > http://www.vlsci.unimelb.edu.au/ > > > > > > -----BEGIN PGP SIGNATURE----- > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > =VPqK > > > -----END PGP SIGNATURE----- > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > Miguel Gila > > CSCS Swiss National Supercomputing Centre > > HPC Solutions > > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland > > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 > > -------------- next part -------------- > > A non-text attachment was scrubbed... > > Name: smime.p7s > > Type: application/pkcs7-signature > > Size: 3239 bytes > > Desc: not available > > Url : > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/214ea9d6/attachment-0001.bin > > ------------------------------ > > Message: 3 > > Date: Thu, 17 Nov 2011 10:14:32 -0000 > > From: "RB. Ezhilalan (Principal Physicist, CUH)" > > Subject: [torqueusers] Parallel processing for MC code > > To: torqueusers at supercluster.org > > Message-ID: > > > > Content-Type: text/plain; charset="us-ascii" > > Hi All, > > I've been trying to set up Torque queuing system on two SUSE10.1 linux > > PCs (PIII!). > > Installed the linux on both PCs, exported home directory containing > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH password > > less communication. All seems to be working fine. > > Downloaded latest version of Torque (number not handy) installed > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > PBS 'nodes' file was created as per guidelines, PBS_SERVER and QUEUE > > attributes were set as default. > > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are free. I > > am not sure whether this confirms PBS/Torque set up correctly. > > I was able to run an executable BEAMnrc user code in batch mode i.e > > using 'exb' command aliased to 'qsub' and sources a built in job script > > file with option p=1 (single job). > > To split the jobs in to two, so that it runs in parallel on the two PCs, > > option p=2 should be issued. However, what I noticed was, the job ran > > twice on the first PC (PC1) but not on both. > > I can't figure out what went wrong, I suspect PBS setup could have some > > issues, May be I can try running the job specifically on PC2 if so what > > command I need to give? > > I would be grateful for any advice! > > Kind Regards, > > Ezhilalan > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/06e4a798/attachment-0001.html > > > ------------------------------ > > Message: 4 > > Date: Thu, 17 Nov 2011 10:18:18 -0600 > > From: Jason Bacon > > Subject: Re: [torqueusers] Parallel processing for MC code > > To: Torque Users Mailing List > > Message-ID: <4EC533CA.2000902 at tds.net> > > Content-Type: text/plain; charset=windows-1252; format=flowed > > How many cores does PC1 have? Note that Torque schedules cores, not > > computers, unless you specifically tell it to with resource requirements. > > Regards, > > -J > > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > > > Hi All, > > > > > > I?ve been trying to set up Torque queuing system on two SUSE10.1 linux > > > PCs (PIII!). > > > > > > Installed the linux on both PCs, exported home directory containing > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > > > password less communication. All seems to be working fine. > > > > > > Downloaded latest version of Torque (number not handy) installed > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and QUEUE > > > attributes were set as default. > > > > > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are free. > > > I am not sure whether this confirms PBS/Torque set up correctly. > > > > > > I was able to run an executable BEAMnrc user code in batch mode i.e > > > using ?exb? command aliased to ?qsub? and sources a built in job > > > script file with option p=1 (single job). > > > > > > To split the jobs in to two, so that it runs in parallel on the two > > > PCs, option p=2 should be issued. However, what I noticed was, the job > > > ran twice on the first PC (PC1) but not on both. > > > > > > I can?t figure out what went wrong, I suspect PBS setup could have > > > some issues, May be I can try running the job specifically on PC2 if > > > so what command I need to give? > > > > > > I would be grateful for any advice! > > > > > > Kind Regards, > > > > > > Ezhilalan > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Jason W. Bacon > > jwbacon at tds.net > > http://personalpages.tds.net/~jwbacon > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ------------------------------ > > Message: 5 > > Date: Thu, 17 Nov 2011 18:19:14 +0100 > > From: Steve Traylen > > Subject: Re: [torqueusers] File staging syntax > > To: Torque Users Mailing List > > Message-ID: > > > > Keywords: CERN SpamKiller Note: -50 > > Content-Type: text/plain; charset="ISO-8859-1" > > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson > > wrote> > > > Andr?, > > > > > > I have not yet had time to reproduce this. I did look through the > change log and there are two suspects. One is in 2.5.6, a fix for > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > > > > > That is as far as I am right now. I will try to get to this as soon > as I can. > > Hi Ken, > > Did you manage to track this down. It's currently making upgrading a > pain. > > Steve. > > -- > > Steve Traylen > > ------------------------------ > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > End of torqueusers Digest, Vol 88, Issue 14 > > ******************************************* > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From lance at quantumbioinc.com Fri Nov 18 07:33:12 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Fri, 18 Nov 2011 09:33:12 -0500 Subject: [torqueusers] procs= not working as documented In-Reply-To: References: Message-ID: The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). -Lance > > Message: 3 > Date: Thu, 17 Nov 2011 17:29:17 -0800 > From: "Brock Palen" > Subject: Re: [torqueusers] procs= not working as documented > To: "Torque Users Mailing List" > Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > Content-Type: text/plain; charset="utf-8" > > Does maui only see one cpu or does mpiexec only see one cpu? > > > > Brock Palen > (734)936-1985 > brockp at umich.edu > - Sent from my Palm Pre, please excuse typos > On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: > > > > Hello All- > > > > It appears that when running with the following specs, the procs= option does not actually work as expected. > > > > ========================================== > > > > #PBS -S /bin/bash > > #PBS -l procs=60 > > #PBS -l pmem=700mb > > #PBS -l walltime=744:00:00 > > #PBS -j oe > > #PBS -q batch > > > > torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented > > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) > > > > ========================================== > > > > If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. > > > > Thank you for your time! > > > > -Lance > > > > From scrusan at ur.rochester.edu Fri Nov 18 07:47:24 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Fri, 18 Nov 2011 09:47:24 -0500 Subject: [torqueusers] procs= not working as documented In-Reply-To: References: Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). Hi Lance, Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. Might as well add your maui.cfg file also. I've found in the past that procs= is troublesome... > > I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. > > This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. > > -Lance > > >> >> Message: 3 >> Date: Thu, 17 Nov 2011 17:29:17 -0800 >> From: "Brock Palen" >> Subject: Re: [torqueusers] procs= not working as documented >> To: "Torque Users Mailing List" >> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >> Content-Type: text/plain; charset="utf-8" >> >> Does maui only see one cpu or does mpiexec only see one cpu? >> >> >> >> Brock Palen >> (734)936-1985 >> brockp at umich.edu >> - Sent from my Palm Pre, please excuse typos >> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >> >> >> >> Hello All- >> >> >> >> It appears that when running with the following specs, the procs= option does not actually work as expected. >> >> >> >> ========================================== >> >> >> >> #PBS -S /bin/bash >> >> #PBS -l procs=60 >> >> #PBS -l pmem=700mb >> >> #PBS -l walltime=744:00:00 >> >> #PBS -j oe >> >> #PBS -q batch >> >> >> >> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >> >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >> >> >> >> ========================================== >> >> >> >> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >> >> >> >> Thank you for your time! >> >> >> >> -Lance >> >> >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= =AGW7 -----END PGP SIGNATURE----- From lance at quantumbioinc.com Fri Nov 18 09:12:06 2011 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Fri, 18 Nov 2011 11:12:06 -0500 Subject: [torqueusers] procs= not working as documented In-Reply-To: References: Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> Hi Steve- Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. Thanks for your help! =============== + head job.pbs #!/bin/bash #PBS -S /bin/bash #PBS -l procs=100 #PBS -l pmem=700mb #PBS -l walltime=744:00:00 #PBS -j oe #PBS -q batch Report run on Fri Nov 18 10:49:38 EST 2011 + pbsnodes --version version: 3.0.2 + diagnose --version maui client version 3.2.6p21 + checkjob 371010 checking job 371010 State: Running Creds: user:josh group:games class:batch qos:DEFAULT WallTime: 00:02:35 of 31:00:00:00 SubmitTime: Fri Nov 18 10:46:33 (Time Queued Total: 00:00:01 Eligible: 00:00:01) StartTime: Fri Nov 18 10:46:34 Total Tasks: 1 Req[0] TaskCount: 26 Partition: DEFAULT Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Dedicated Resources Per Task: PROCS: 1 MEM: 700M NodeCount: 10 Allocated Nodes: [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] [compute-0-13:2][compute-0-14:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00) PE: 26.00 StartPriority: 4716 + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" SERVERHOST gondor ADMIN1 maui root ADMIN3 ALL RMCFG[base] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:01:00 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 FSPOLICY DEDICATEDPS FSDEPTH 7 FSINTERVAL 86400 FSDECAY 0.50 FSWEIGHT 200 FSUSERWEIGHT 1 FSGROUPWEIGHT 1000 FSQOSWEIGHT 1000 FSACCOUNTWEIGHT 1 FSCLASSWEIGHT 1000 USERWEIGHT 4 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE RESERVATIONDEPTH 8 MAXJOBPERUSERPOLICY OFF MAXJOBPERUSERCOUNT 8 MAXPROCPERUSERPOLICY OFF MAXPROCPERUSERCOUNT 256 MAXPROCSECONDPERUSERPOLICY OFF MAXPROCSECONDPERUSERCOUNT 36864000 MAXJOBQUEUEDPERUSERPOLICY OFF MAXJOBQUEUEDPERUSERCOUNT 2 JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SHARED JOBMAXOVERRUN 99:00:00:00 DEFERCOUNT 8192 DEFERTIME 0 CLASSCFG[developer] FSTARGET=40.00+ CLASSCFG[lowprio] PRIORITY=-1000 SRCFG[developer] CLASSLIST=developer SRCFG[developer] ACCESS=dedicated SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri SRCFG[developer] STARTTIME=08:00:00 SRCFG[developer] ENDTIME=18:00:00 SRCFG[developer] TIMELIMIT=2:00:00 SRCFG[developer] RESOURCES=PROCS(8) USERCFG[DEFAULT] FSTARGET=100.0 =============== -Lance On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > >> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). > > Hi Lance, > > Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. > > Might as well add your maui.cfg file also. > > I've found in the past that procs= is troublesome... > >> >> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >> >> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). > > Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. > >> >> -Lance >> >> >>> >>> Message: 3 >>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>> From: "Brock Palen" >>> Subject: Re: [torqueusers] procs= not working as documented >>> To: "Torque Users Mailing List" >>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Does maui only see one cpu or does mpiexec only see one cpu? >>> >>> >>> >>> Brock Palen >>> (734)936-1985 >>> brockp at umich.edu >>> - Sent from my Palm Pre, please excuse typos >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>> >>> >>> >>> Hello All- >>> >>> >>> >>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>> >>> >>> >>> ========================================== >>> >>> >>> >>> #PBS -S /bin/bash >>> >>> #PBS -l procs=60 >>> >>> #PBS -l pmem=700mb >>> >>> #PBS -l walltime=744:00:00 >>> >>> #PBS -j oe >>> >>> #PBS -q batch >>> >>> >>> >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>> >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>> >>> >>> >>> ========================================== >>> >>> >>> >>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>> >>> >>> >>> Thank you for your time! >>> >>> >>> >>> -Lance >>> >>> >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > =AGW7 > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From RB.Ezhilalan at hse.ie Fri Nov 18 09:44:30 2011 From: RB.Ezhilalan at hse.ie (RB. Ezhilalan (Principal Physicist, CUH)) Date: Fri, 18 Nov 2011 16:44:30 -0000 Subject: [torqueusers] torqueusers Digest, Vol 88, Issue 16 In-Reply-To: References: Message-ID: Jason, I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this, the job ran on one core (linux-01) only. Then I removed the 'np' option from the list under the notion, the system will 'autodetect' the cores. Ezhilalan Ezhilalan Ramalingam M.Sc.,DABR., Principal Physicist (Radiotherapy), Medical Physics Department, Cork University Hospital, Wilton, Cork Ireland Tel. 00353 21 4922533 Fax.00353 21 4921300 Email: rb.ezhilalan at hse.ie -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: 18 November 2011 16:12 To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 88, Issue 16 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: Parallel processing for MC code (Jason Bacon) 2. Re: procs= not working as documented (Lance Westerhoff) 3. Re: procs= not working as documented (Steve Crusan) 4. Re: procs= not working as documented (Lance Westerhoff) ---------------------------------------------------------------------- Message: 1 Date: Fri, 18 Nov 2011 07:57:23 -0600 From: Jason Bacon Subject: Re: [torqueusers] Parallel processing for MC code To: Torque Users Mailing List Message-ID: <4EC66443.3080608 at tds.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed I was only wondering if you had "np=2" in the Linux-01 entry, or if Torque was configured to autodetect the number of cores and there were two. That would have explained the scheduling behavior. Regards, -J On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > Hi Jason, > > PC1 (linux-01) is a single core PC like PC2, I defined the > server_priv/nodes file as; > > Linux-01 > > Linux-02 > > As you have mentioned may be resource requirement needs to be properly > set up. Do you have any suggestions? > > Many thanks, > > Ezhilalan > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > torqueusers-request at supercluster.org > Sent: 17 November 2011 17:20 > To: torqueusers at supercluster.org > Subject: torqueusers Digest, Vol 88, Issue 14 > > Send torqueusers mailing list submissions to > > torqueusers at supercluster.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.supercluster.org/mailman/listinfo/torqueusers > > or, via email, send a message with subject or body 'help' to > > torqueusers-request at supercluster.org > > You can reach the person managing the list at > > torqueusers-owner at supercluster.org > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of torqueusers digest..." > > Today's Topics: > > 1. Re: Random SCP errors when transfering to/from CREAM sandbox > > (Christopher Samuel) > > 2. Re: Random SCP errors when transfering to/from CREAM sandbox > > (Gila Arrondo Miguel Angel) > > 3. Parallel processing for MC code > > (RB. Ezhilalan (Principal Physicist, CUH)) > > 4. Re: Parallel processing for MC code (Jason Bacon) > > 5. Re: File staging syntax (Steve Traylen) > > ---------------------------------------------------------------------- > > Message: 1 > > Date: Thu, 17 Nov 2011 13:29:44 +1100 > > From: Christopher Samuel > > Subject: Re: [torqueusers] Random SCP errors when transfering to/from > > CREAM sandbox > > To: torqueusers at supercluster.org > > Message-ID: <4EC47198.1040709 at unimelb.edu.au> > > Content-Type: text/plain; charset=ISO-8859-1 > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > Many thanks for your answer. We've made sure that the > > > keys are okay, as well as disabling hoskeychecking to > > > test it. > > Can you try and scp as that user to see whether it > > complains about anything else ? > > It may be that it is prompting the user to accept a > > host key if they don't already have it. > > cheers, > > Chris > > - -- > > Christopher Samuel - Senior Systems Administrator > > VLSCI - Victorian Life Sciences Computation Initiative > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.11 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > =VPqK > > -----END PGP SIGNATURE----- > > ------------------------------ > > Message: 2 > > Date: Thu, 17 Nov 2011 07:55:50 +0000 > > From: "Gila Arrondo Miguel Angel" > > Subject: Re: [torqueusers] Random SCP errors when transfering to/from > > CREAM sandbox > > To: Torque Users Mailing List > > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch> > > Content-Type: text/plain; charset="us-ascii" > > Hi Chris, > > I've done that in many WNs and with different users, so I don't think > that is be the issue. I've also checked for scheduled tasks that > interact with the ssh keys, but the errors happen at random times, not > when the scheduled tasks run... :-S > > I'm running out of options here. > > Cheers, > > Miguel > > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > > Hash: SHA1 > > > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > >> Many thanks for your answer. We've made sure that the > > >> keys are okay, as well as disabling hoskeychecking to > > >> test it. > > > > > > Can you try and scp as that user to see whether it > > > complains about anything else ? > > > > > > It may be that it is prompting the user to accept a > > > host key if they don't already have it. > > > > > > cheers, > > > Chris > > > - -- > > > Christopher Samuel - Senior Systems Administrator > > > VLSCI - Victorian Life Sciences Computation Initiative > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > http://www.vlsci.unimelb.edu.au/ > > > > > > -----BEGIN PGP SIGNATURE----- > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > =VPqK > > > -----END PGP SIGNATURE----- > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > Miguel Gila > > CSCS Swiss National Supercomputing Centre > > HPC Solutions > > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland > > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 > > -------------- next part -------------- > > A non-text attachment was scrubbed... > > Name: smime.p7s > > Type: application/pkcs7-signature > > Size: 3239 bytes > > Desc: not available > > Url : > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2 14ea9d6/attachment-0001.bin > > ------------------------------ > > Message: 3 > > Date: Thu, 17 Nov 2011 10:14:32 -0000 > > From: "RB. Ezhilalan (Principal Physicist, CUH)" > > Subject: [torqueusers] Parallel processing for MC code > > To: torqueusers at supercluster.org > > Message-ID: > > > > Content-Type: text/plain; charset="us-ascii" > > Hi All, > > I've been trying to set up Torque queuing system on two SUSE10.1 linux > > PCs (PIII!). > > Installed the linux on both PCs, exported home directory containing > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH password > > less communication. All seems to be working fine. > > Downloaded latest version of Torque (number not handy) installed > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > PBS 'nodes' file was created as per guidelines, PBS_SERVER and QUEUE > > attributes were set as default. > > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are free. I > > am not sure whether this confirms PBS/Torque set up correctly. > > I was able to run an executable BEAMnrc user code in batch mode i.e > > using 'exb' command aliased to 'qsub' and sources a built in job script > > file with option p=1 (single job). > > To split the jobs in to two, so that it runs in parallel on the two PCs, > > option p=2 should be issued. However, what I noticed was, the job ran > > twice on the first PC (PC1) but not on both. > > I can't figure out what went wrong, I suspect PBS setup could have some > > issues, May be I can try running the job specifically on PC2 if so what > > command I need to give? > > I would be grateful for any advice! > > Kind Regards, > > Ezhilalan > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0 6e4a798/attachment-0001.html > > > ------------------------------ > > Message: 4 > > Date: Thu, 17 Nov 2011 10:18:18 -0600 > > From: Jason Bacon > > Subject: Re: [torqueusers] Parallel processing for MC code > > To: Torque Users Mailing List > > Message-ID: <4EC533CA.2000902 at tds.net> > > Content-Type: text/plain; charset=windows-1252; format=flowed > > How many cores does PC1 have? Note that Torque schedules cores, not > > computers, unless you specifically tell it to with resource requirements. > > Regards, > > -J > > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > > > Hi All, > > > > > > I?ve been trying to set up Torque queuing system on two SUSE10.1 linux > > > PCs (PIII!). > > > > > > Installed the linux on both PCs, exported home directory containing > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > > > password less communication. All seems to be working fine. > > > > > > Downloaded latest version of Torque (number not handy) installed > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and QUEUE > > > attributes were set as default. > > > > > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are free. > > > I am not sure whether this confirms PBS/Torque set up correctly. > > > > > > I was able to run an executable BEAMnrc user code in batch mode i.e > > > using ?exb? command aliased to ?qsub? and sources a built in job > > > script file with option p=1 (single job). > > > > > > To split the jobs in to two, so that it runs in parallel on the two > > > PCs, option p=2 should be issued. However, what I noticed was, the job > > > ran twice on the first PC (PC1) but not on both. > > > > > > I can?t figure out what went wrong, I suspect PBS setup could have > > > some issues, May be I can try running the job specifically on PC2 if > > > so what command I need to give? > > > > > > I would be grateful for any advice! > > > > > > Kind Regards, > > > > > > Ezhilalan > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Jason W. Bacon > > jwbacon at tds.net > > http://personalpages.tds.net/~jwbacon > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ------------------------------ > > Message: 5 > > Date: Thu, 17 Nov 2011 18:19:14 +0100 > > From: Steve Traylen > > Subject: Re: [torqueusers] File staging syntax > > To: Torque Users Mailing List > > Message-ID: > > > > Keywords: CERN SpamKiller Note: -50 > > Content-Type: text/plain; charset="ISO-8859-1" > > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson > > wrote> > > > Andr?, > > > > > > I have not yet had time to reproduce this. I did look through the > change log and there are two suspects. One is in 2.5.6, a fix for > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > > > > > That is as far as I am right now. I will try to get to this as soon > as I can. > > Hi Ken, > > Did you manage to track this down. It's currently making upgrading a > pain. > > Steve. > > -- > > Steve Traylen > > ------------------------------ > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > End of torqueusers Digest, Vol 88, Issue 14 > > ******************************************* > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Message: 2 Date: Fri, 18 Nov 2011 09:33:12 -0500 From: Lance Westerhoff Subject: Re: [torqueusers] procs= not working as documented To: torqueusers at supercluster.org Message-ID: Content-Type: text/plain; charset=us-ascii The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). -Lance > > Message: 3 > Date: Thu, 17 Nov 2011 17:29:17 -0800 > From: "Brock Palen" > Subject: Re: [torqueusers] procs= not working as documented > To: "Torque Users Mailing List" > Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > Content-Type: text/plain; charset="utf-8" > > Does maui only see one cpu or does mpiexec only see one cpu? > > > > Brock Palen > (734)936-1985 > brockp at umich.edu > - Sent from my Palm Pre, please excuse typos > On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: > > > > Hello All- > > > > It appears that when running with the following specs, the procs= option does not actually work as expected. > > > > ========================================== > > > > #PBS -S /bin/bash > > #PBS -l procs=60 > > #PBS -l pmem=700mb > > #PBS -l walltime=744:00:00 > > #PBS -j oe > > #PBS -q batch > > > > torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented > > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) > > > > ========================================== > > > > If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. > > > > Thank you for your time! > > > > -Lance > > > > ------------------------------ Message: 3 Date: Fri, 18 Nov 2011 09:47:24 -0500 From: Steve Crusan Subject: Re: [torqueusers] procs= not working as documented To: Torque Users Mailing List Message-ID: Content-Type: text/plain; charset=us-ascii -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). Hi Lance, Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. Might as well add your maui.cfg file also. I've found in the past that procs= is troublesome... > > I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. > > This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. > > -Lance > > >> >> Message: 3 >> Date: Thu, 17 Nov 2011 17:29:17 -0800 >> From: "Brock Palen" >> Subject: Re: [torqueusers] procs= not working as documented >> To: "Torque Users Mailing List" >> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >> Content-Type: text/plain; charset="utf-8" >> >> Does maui only see one cpu or does mpiexec only see one cpu? >> >> >> >> Brock Palen >> (734)936-1985 >> brockp at umich.edu >> - Sent from my Palm Pre, please excuse typos >> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >> >> >> >> Hello All- >> >> >> >> It appears that when running with the following specs, the procs= option does not actually work as expected. >> >> >> >> ========================================== >> >> >> >> #PBS -S /bin/bash >> >> #PBS -l procs=60 >> >> #PBS -l pmem=700mb >> >> #PBS -l walltime=744:00:00 >> >> #PBS -j oe >> >> #PBS -q batch >> >> >> >> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >> >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >> >> >> >> ========================================== >> >> >> >> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >> >> >> >> Thank you for your time! >> >> >> >> -Lance >> >> >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= =AGW7 -----END PGP SIGNATURE----- ------------------------------ Message: 4 Date: Fri, 18 Nov 2011 11:12:06 -0500 From: Lance Westerhoff Subject: Re: [torqueusers] procs= not working as documented To: Torque Users Mailing List Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com> Content-Type: text/plain; charset=us-ascii Hi Steve- Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. Thanks for your help! =============== + head job.pbs #!/bin/bash #PBS -S /bin/bash #PBS -l procs=100 #PBS -l pmem=700mb #PBS -l walltime=744:00:00 #PBS -j oe #PBS -q batch Report run on Fri Nov 18 10:49:38 EST 2011 + pbsnodes --version version: 3.0.2 + diagnose --version maui client version 3.2.6p21 + checkjob 371010 checking job 371010 State: Running Creds: user:josh group:games class:batch qos:DEFAULT WallTime: 00:02:35 of 31:00:00:00 SubmitTime: Fri Nov 18 10:46:33 (Time Queued Total: 00:00:01 Eligible: 00:00:01) StartTime: Fri Nov 18 10:46:34 Total Tasks: 1 Req[0] TaskCount: 26 Partition: DEFAULT Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Dedicated Resources Per Task: PROCS: 1 MEM: 700M NodeCount: 10 Allocated Nodes: [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] [compute-0-13:2][compute-0-14:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00) PE: 26.00 StartPriority: 4716 + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" SERVERHOST gondor ADMIN1 maui root ADMIN3 ALL RMCFG[base] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:01:00 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 FSPOLICY DEDICATEDPS FSDEPTH 7 FSINTERVAL 86400 FSDECAY 0.50 FSWEIGHT 200 FSUSERWEIGHT 1 FSGROUPWEIGHT 1000 FSQOSWEIGHT 1000 FSACCOUNTWEIGHT 1 FSCLASSWEIGHT 1000 USERWEIGHT 4 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE RESERVATIONDEPTH 8 MAXJOBPERUSERPOLICY OFF MAXJOBPERUSERCOUNT 8 MAXPROCPERUSERPOLICY OFF MAXPROCPERUSERCOUNT 256 MAXPROCSECONDPERUSERPOLICY OFF MAXPROCSECONDPERUSERCOUNT 36864000 MAXJOBQUEUEDPERUSERPOLICY OFF MAXJOBQUEUEDPERUSERCOUNT 2 JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SHARED JOBMAXOVERRUN 99:00:00:00 DEFERCOUNT 8192 DEFERTIME 0 CLASSCFG[developer] FSTARGET=40.00+ CLASSCFG[lowprio] PRIORITY=-1000 SRCFG[developer] CLASSLIST=developer SRCFG[developer] ACCESS=dedicated SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri SRCFG[developer] STARTTIME=08:00:00 SRCFG[developer] ENDTIME=18:00:00 SRCFG[developer] TIMELIMIT=2:00:00 SRCFG[developer] RESOURCES=PROCS(8) USERCFG[DEFAULT] FSTARGET=100.0 =============== -Lance On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > >> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). > > Hi Lance, > > Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. > > Might as well add your maui.cfg file also. > > I've found in the past that procs= is troublesome... > >> >> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >> >> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). > > Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. > >> >> -Lance >> >> >>> >>> Message: 3 >>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>> From: "Brock Palen" >>> Subject: Re: [torqueusers] procs= not working as documented >>> To: "Torque Users Mailing List" >>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Does maui only see one cpu or does mpiexec only see one cpu? >>> >>> >>> >>> Brock Palen >>> (734)936-1985 >>> brockp at umich.edu >>> - Sent from my Palm Pre, please excuse typos >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>> >>> >>> >>> Hello All- >>> >>> >>> >>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>> >>> >>> >>> ========================================== >>> >>> >>> >>> #PBS -S /bin/bash >>> >>> #PBS -l procs=60 >>> >>> #PBS -l pmem=700mb >>> >>> #PBS -l walltime=744:00:00 >>> >>> #PBS -j oe >>> >>> #PBS -q batch >>> >>> >>> >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>> >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>> >>> >>> >>> ========================================== >>> >>> >>> >>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>> >>> >>> >>> Thank you for your time! >>> >>> >>> >>> -Lance >>> >>> >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > =AGW7 > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 88, Issue 16 ******************************************* From dbeer at adaptivecomputing.com Fri Nov 18 09:59:36 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 18 Nov 2011 09:59:36 -0700 (MST) Subject: [torqueusers] qrerun fails due to Unauthorized Request In-Reply-To: <90DC71AA-3F6F-452E-B53D-6BE89457AEE4@donders.ru.nl> Message-ID: Are the super user and or your user at that box managers on pbs_server? You would need manager privileges to qrerun a job. David ----- Original Message ----- > Dear torque users, > > I am trying to use qrerun in a shell script to deal with the > (potential) limit in available MATLAB licenses. Let me shortly > outline the idea before explaining the problem. > > I have a shell script that starts MATLAB with the "-r " > option for a MATLAB script. In case there is no license available, > MATLAB returns immediately with a descriptive error about the > license failure. I would like to catch that error and if it happens, > issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job > for execution at a later time. Note that I am aware of the ability > to configure floating resources in moab, but I am using maui. > Furthermore, the floating resources for the Matlab license don't > optimally represent the license requirements for scheduling multiple > jobs by the same user on a multicore machine. Hence I prefer to use > qrerun instead of making the license a managed resource. > > The problem I run into can be summarized in the following snippet > from the command line. I schedule a simple job that subsequenty > starts running on one of the execution hosts: > > roboos at mentat001> echo sleep 1000 | qsub > 45254.dccn-l014.dccn.nl > > Then I try to use qrerun, first as regular user then as super user > (which I normally would not do of course): > > roboos at mentat001> qrerun 45254 > qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl > roboos at mentat001> sudo qrerun 45254 > qrerun: Unauthorized Request MSG=operation not permitted > 45254.dccn-l014.dccn.nl > > So as root/administrative user I am also not allowed to do it from > the client machine. I am able to log in directly on the torque > server, where as regular user I am also not allowed to qrerun. It is > not a general failure of qrerun, since the the root user on the > torque server is allowed to use it: > > roboos at mentat001> ssh torque > roboos at torque> qrerun 45254 > qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl > roboos at torque> sudo qrerun 45254 > > after which the job is correctly requeued and starts over again. > > To provide some info from the log files: as regular user I get the > following message in /var/spool/torque/server_logs > > 11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply > code=15018(Request invalid for state of job), aux=0, type=RerunJob, > from roboos at mentat001.dccn.nl > > and as root on the torque server I get > > 11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply > code=15018(Request invalid for state of job), aux=0, type=RerunJob, > from root at dccn-l014.dccn.nl > > The log mesaage is basically the same. In the log message on the > execution host I cannot find anything that pertains to the failed > qrerun request. > > Does anyone have an idea on what might be the problem for the regular > user not being allowed to restart the job? I tried the same thing on > a different torque cluster (not managed by me) that I have access > to, and also there it failed. > > > best regards, > Robert > > > > ----------------------------------------------------------- > Robert Oostenveld, PhD > Senior Researcher & MEG Physicist > Donders Institute for Brain, Cognition and Behaviour > Centre for Cognitive Neuroimaging > Radboud University Nijmegen > tel.: +31 (0)24 3619695 > e-mail: r.oostenveld at donders.ru.nl > web: http://www.ru.nl/neuroimaging > skype: r.oostenveld > ----------------------------------------------------------- > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From scrusan at ur.rochester.edu Fri Nov 18 12:52:04 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Fri, 18 Nov 2011 14:52:04 -0500 Subject: [torqueusers] procs= not working as documented In-Reply-To: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Lance, I am not sure about this one, and I think this might be a bug in Maui. Now, just to be clear, this only happens if there are not enough resources available? Regardless of your fairshare weights, or node access policies (which look fine, btw), this job should NOT start unless the resources are available. What happens if you pause (stop) the scheduler, and just run the job manually with qrun? It would be interesting to see how TORQUE is interpreting the job versus Maui. In the example below, Maui seemed to think the job only required 26 tasks, and scheduled them appropriately ( if 26 procs were the true request). Also check your $PBS_NODES file to see what TORQUE is writing to your environment. I know you can dive deep into the Maui logs by turning up the debug level, but let's start somewhere closer... PS: I cannot reproduce this, but I'm running Moab 6.0.4 + TORQUE 2.5.6. ~Steve On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: > > Hi Steve- > > Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. > > Thanks for your help! > > =============== > > + head job.pbs > > #!/bin/bash > #PBS -S /bin/bash > #PBS -l procs=100 > #PBS -l pmem=700mb > #PBS -l walltime=744:00:00 > #PBS -j oe > #PBS -q batch > > Report run on Fri Nov 18 10:49:38 EST 2011 > + pbsnodes --version > version: 3.0.2 > + diagnose --version > maui client version 3.2.6p21 > + checkjob 371010 > > > checking job 371010 > > State: Running > Creds: user:josh group:games class:batch qos:DEFAULT > WallTime: 00:02:35 of 31:00:00:00 > SubmitTime: Fri Nov 18 10:46:33 > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > StartTime: Fri Nov 18 10:46:34 > Total Tasks: 1 > > Req[0] TaskCount: 26 Partition: DEFAULT > Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Dedicated Resources Per Task: PROCS: 1 MEM: 700M > NodeCount: 10 > Allocated Nodes: > [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] > [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] > [compute-0-13:2][compute-0-14:2] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00) > PE: 26.00 StartPriority: 4716 > > + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" > SERVERHOST gondor > ADMIN1 maui root > ADMIN3 ALL > RMCFG[base] TYPE=PBS > AMCFG[bank] TYPE=NONE > RMPOLLINTERVAL 00:01:00 > SERVERPORT 42559 > SERVERMODE NORMAL > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > QUEUETIMEWEIGHT 1 > FSPOLICY DEDICATEDPS > FSDEPTH 7 > FSINTERVAL 86400 > FSDECAY 0.50 > FSWEIGHT 200 > FSUSERWEIGHT 1 > FSGROUPWEIGHT 1000 > FSQOSWEIGHT 1000 > FSACCOUNTWEIGHT 1 > FSCLASSWEIGHT 1000 > USERWEIGHT 4 > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEALLOCATIONPOLICY MINRESOURCE > RESERVATIONDEPTH 8 > MAXJOBPERUSERPOLICY OFF > MAXJOBPERUSERCOUNT 8 > MAXPROCPERUSERPOLICY OFF > MAXPROCPERUSERCOUNT 256 > MAXPROCSECONDPERUSERPOLICY OFF > MAXPROCSECONDPERUSERCOUNT 36864000 > MAXJOBQUEUEDPERUSERPOLICY OFF > MAXJOBQUEUEDPERUSERCOUNT 2 > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SHARED > JOBMAXOVERRUN 99:00:00:00 > DEFERCOUNT 8192 > DEFERTIME 0 > CLASSCFG[developer] FSTARGET=40.00+ > CLASSCFG[lowprio] PRIORITY=-1000 > SRCFG[developer] CLASSLIST=developer > SRCFG[developer] ACCESS=dedicated > SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri > SRCFG[developer] STARTTIME=08:00:00 > SRCFG[developer] ENDTIME=18:00:00 > SRCFG[developer] TIMELIMIT=2:00:00 > SRCFG[developer] RESOURCES=PROCS(8) > USERCFG[DEFAULT] FSTARGET=100.0 > > =============== > > -Lance > > > On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> >> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >> >>> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). >> >> Hi Lance, >> >> Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. >> >> Might as well add your maui.cfg file also. >> >> I've found in the past that procs= is troublesome... >> >>> >>> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >>> >>> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). >> >> Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. >> >>> >>> -Lance >>> >>> >>>> >>>> Message: 3 >>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>> From: "Brock Palen" >>>> Subject: Re: [torqueusers] procs= not working as documented >>>> To: "Torque Users Mailing List" >>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>> >>>> >>>> >>>> Brock Palen >>>> (734)936-1985 >>>> brockp at umich.edu >>>> - Sent from my Palm Pre, please excuse typos >>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>>> >>>> >>>> >>>> Hello All- >>>> >>>> >>>> >>>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>>> >>>> >>>> >>>> ========================================== >>>> >>>> >>>> >>>> #PBS -S /bin/bash >>>> >>>> #PBS -l procs=60 >>>> >>>> #PBS -l pmem=700mb >>>> >>>> #PBS -l walltime=744:00:00 >>>> >>>> #PBS -j oe >>>> >>>> #PBS -q batch >>>> >>>> >>>> >>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>>> >>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>>> >>>> >>>> >>>> ========================================== >>>> >>>> >>>> >>>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>>> >>>> >>>> >>>> Thank you for your time! >>>> >>>> >>>> >>>> -Lance >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> ---------------------- >> Steve Crusan >> System Administrator >> Center for Research Computing >> University of Rochester >> https://www.crc.rochester.edu/ >> >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >> Comment: GPGTools - http://gpgtools.org >> >> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >> =AGW7 >> -----END PGP SIGNATURE----- >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOxrdrAAoJENS19LGOpgqKDuwIAIfU+NDyoUfD3TGcQ0ol2JJV DXjvHd2ci3nJZaX28XEreQfOhrSA9GTTG1/x+wlwj+PdeNXXNedumnkZnSFQ38yp mwArdbSuby3fAjO11qrsqL34u5LV0FYtMpLpA2ibYcHEHS7L6eLvYUNYLp5DPEB6 qbHHDINw086Jcf9qzPfbggjWfxp63mil5Az14JSv3nWm2p+KaEommocN6I5lJycr DOs7V0ejZjys+F5ZRbzc2DDQzHVgEwTf6f0itW5ZNQy0BdnlLXBhOimGijpfaV7T Hqug/ljzZ3Z229uEA8JUvS7/bnP9+QfPRacKEAC5t4xjHS1vnWOXlDbFP4fUskQ= =avg0 -----END PGP SIGNATURE----- From Adrian.Sevcenco at cern.ch Sun Nov 20 02:10:35 2011 From: Adrian.Sevcenco at cern.ch (Adrian Sevcenco) Date: Sun, 20 Nov 2011 11:10:35 +0200 Subject: [torqueusers] reporting of max cpus found available (not offline) Message-ID: <4EC8C40B.2050002@cern.ch> Hi! i see that qstat at he end gives a very useful information about the available (online used and unused) cores. there is no reporting command that would report this (only what is configured) Is there a way or can someone point me to relevant portion of API that would give me exactly this? Thanks! Adrian -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1984 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111120/47847fa5/attachment.bin From cholam20 at yahoo.co.in Mon Nov 21 11:30:24 2011 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Tue, 22 Nov 2011 00:00:24 +0530 (IST) Subject: [torqueusers] I AM FREE NOW... Message-ID: <1321900224.61957.androidMobile@web137308.mail.in.yahoo.com>

Hello!
i get to work around my own schedule
http://030829f.netsolhost.com/profile/41ShaunWhite/
talk to you later

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111122/37245d67/attachment-0001.html From dbeer at adaptivecomputing.com Mon Nov 21 12:06:02 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 21 Nov 2011 12:06:02 -0700 (MST) Subject: [torqueusers] Job Nanny Poll In-Reply-To: Message-ID: <2d4d22d1-4c4b-4846-8bdd-f498466998c4@mail> All, Just a quick poll question - do people use the job delete nanny functionality in TORQUE? If you do, in qmgr you would have the line: set job_nanny = True I'm curious how many people are using it - this seems like very repetitive functionality to me (pbs_mom does pretty much the same thing already) and I personally think job_force_cancel_time is better, but I may be biased. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From lloyd_brown at byu.edu Mon Nov 21 12:11:02 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Mon, 21 Nov 2011 12:11:02 -0700 Subject: [torqueusers] Job Nanny Poll In-Reply-To: <2d4d22d1-4c4b-4846-8bdd-f498466998c4@mail> References: <2d4d22d1-4c4b-4846-8bdd-f498466998c4@mail> Message-ID: <4ECAA246.3020408@byu.edu> David, We currently have job_nanny set in our qmgr. I don't remember many of the details, but we were struggling a long time ago with some of the issues related to stray jobs not being deleted, and someone at Adaptive (or maybe it was still CRI at the time), recommended we set it, as an additional check. If there's a better way to do it, that's fine, but many of us try really hard not to change things once they're working. That and we're not aware of the new features. I mean according to the changelog, that parameter was added in 2.4.9, and I think we've been using job_nanny much longer than that. I wasn't aware of it until your email. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 11/21/2011 12:06 PM, David Beer wrote: > All, > > Just a quick poll question - do people use the job delete nanny functionality in TORQUE? If you do, in qmgr you would have the line: > > set job_nanny = True > > I'm curious how many people are using it - this seems like very repetitive functionality to me (pbs_mom does pretty much the same thing already) and I personally think job_force_cancel_time is better, but I may be biased. > From rjacobi at email.arizona.edu Sat Nov 19 18:08:13 2011 From: rjacobi at email.arizona.edu (Robert Jacobi) Date: Sat, 19 Nov 2011 18:08:13 -0700 Subject: [torqueusers] Torque deletes all jobs of user on same node Message-ID: Hello All, We've recently run into a curious problem with torque. When a user deletes on of his jobs using "qdel jobid", and this job to be deleted spans more than one processor on the node, then all other jobs of the same user on the same node are canceled as well. If the deleted job only runs on one processor, then the other jobs of the user on the node are not affected and keep running. Thus it seems to me that whenever the pbs mom on the node has to delete from more than one processor it somehow indiscriminately tries to delete them from all processors and the other user's jobs might only be unaffected due to the lack of privileges over other users processes. At this point I've no clue how to further diagnose or solve this issue. I've tried to google this problem but couldn't find anything, so I hope you have an idea. Thank You, Robert -- Robert Jacobi Research Assistant University of Arizona Department of Aerospace & Mechanical Engineering 1130 N. Mountain Ave. Tucson, AZ, 85721-0119 tel: +1 (520) 621 4369 mail: rjacobi at email.arizona.edu The less time you spent on algebra in life, the more time you have to be a happy person. (Kerschen) Doubt is not a pleasant condition, but certainty is absurd. (Voltaire) All great truths begin as blasphemies. (Shaw) Denken ist etwas, das auf Schwierigkeiten folgt und dem das Handeln vorausgeht.(Brecht) From biswas.koushik at gmail.com Sun Nov 20 15:18:54 2011 From: biswas.koushik at gmail.com (Koushik Biswas) Date: Sun, 20 Nov 2011 17:18:54 -0500 Subject: [torqueusers] Jobs stays in queue unless forced using qrun Message-ID: I have installed torqueue 2.5.5. server, pbs_sched and moms seem to be running OK. However, after submitting a job the darn thing does not start to run. I am just testing with 2 nodes and my jobscript requests only 1 node. qrun will work. I have checked the usual things qstat -f, qmgr -c "p s" etc. Don't see anything unusual. The server_logs does say job submitted by user. But the sched_logs does not say anything beyond the first 3 lines that say log opened, accounting opened, and startup id. I am puzzled mostly because I am ignorant about this! There are quite a few posts on this where users say jobs stay in queue, but didn't see anything that could help my case. Clearely, my pbs_schedular is not "talking" with the server or vice versa. (server_logs does not say anything about that though!) Best, Koushik -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111120/32398581/attachment.html From glen.beane at gmail.com Mon Nov 21 12:15:14 2011 From: glen.beane at gmail.com (Glen Beane) Date: Mon, 21 Nov 2011 14:15:14 -0500 Subject: [torqueusers] Job Nanny Poll In-Reply-To: <2d4d22d1-4c4b-4846-8bdd-f498466998c4@mail> References: <2d4d22d1-4c4b-4846-8bdd-f498466998c4@mail> Message-ID: On Mon, Nov 21, 2011 at 2:06 PM, David Beer wrote: > All, > > Just a quick poll question - do people use the job delete nanny functionality in TORQUE? If you do, in qmgr you would have the line: > > set job_nanny = True > > I'm curious how many people are using it - this seems like very repetitive functionality to me (pbs_mom does pretty much the same thing already) and I personally think job_force_cancel_time is better, but I may be biased. I use the job delete nanny, but I am not familiar with job_force_cancel_time. I have been using the job delete nanny for a long time. What exactly does it do? I presume some of the multi-threading in pbs_server in TORQUE 4.0 can clean up some of this code a little since pbs_server can spawn a thread to hang out and manage the job delete (rather than needing to set a work task to check the status of the delete in the future) From stevejones at stanford.edu Mon Nov 21 12:22:00 2011 From: stevejones at stanford.edu (Steve Jones) Date: Mon, 21 Nov 2011 11:22:00 -0800 (PST) Subject: [torqueusers] Job Nanny Poll In-Reply-To: Message-ID: <758213214.389515.1321903320926.JavaMail.root@zm09.stanford.edu> ----- Original Message ----- > On Mon, Nov 21, 2011 at 2:06 PM, David Beer > wrote: > > All, > > > > Just a quick poll question - do people use the job delete nanny > > functionality in TORQUE? If you do, in qmgr you would have the line: > > > > set job_nanny = True > > > > I'm curious how many people are using it - this seems like very > > repetitive functionality to me (pbs_mom does pretty much the same > > thing already) and I personally think job_force_cancel_time is > > better, but I may be biased. > > > I use the job delete nanny, but I am not familiar with > job_force_cancel_time. I have been using the job delete nanny for a > long time. > > What exactly does it do? I presume some of the multi-threading in > pbs_server in TORQUE 4.0 can clean up some of this code a little since > pbs_server can spawn a thread to hang out and manage the job delete > (rather than needing to set a work task to check the status of the > delete in the future) I'm also using job_nanny, this is the first I've heard of job_force_cancel_time. Following a quick search it looks like it might have been undocumented for a short period. So it takes an int but there's not a recommended value, maybe 300? Is this the automation of 'qdel -p [jobid]' or similar for jobs *stuck* when a node stops responding? Steve From dbeer at adaptivecomputing.com Mon Nov 21 13:55:12 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 21 Nov 2011 13:55:12 -0700 (MST) Subject: [torqueusers] Job Nanny Poll In-Reply-To: <758213214.389515.1321903320926.JavaMail.root@zm09.stanford.edu> Message-ID: <2eb52e5d-70b7-4321-bae5-3edbb7237288@mail> ----- Original Message ----- > > > ----- Original Message ----- > > On Mon, Nov 21, 2011 at 2:06 PM, David Beer > > wrote: > > > All, > > > > > > Just a quick poll question - do people use the job delete nanny > > > functionality in TORQUE? If you do, in qmgr you would have the > > > line: > > > > > > set job_nanny = True > > > > > > I'm curious how many people are using it - this seems like very > > > repetitive functionality to me (pbs_mom does pretty much the same > > > thing already) and I personally think job_force_cancel_time is > > > better, but I may be biased. > > > > > > I use the job delete nanny, but I am not familiar with > > job_force_cancel_time. I have been using the job delete nanny for a > > long time. > > > > What exactly does it do? I presume some of the multi-threading in > > pbs_server in TORQUE 4.0 can clean up some of this code a little > > since > > pbs_server can spawn a thread to hang out and manage the job delete > > (rather than needing to set a work task to check the status of the > > delete in the future) > > I'm also using job_nanny, this is the first I've heard of > job_force_cancel_time. Following a quick search it looks like it > might have been undocumented for a short period. So it takes an int > but there's not a recommended value, maybe 300? Is this the > automation of 'qdel -p [jobid]' or similar for jobs *stuck* when a > node stops responding? > > Steve > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > job_force_cancel_time is different from the nanny (some sites may wish to do both, even). I implemented job_force_cancel_time at the request of a customer for this use case: - Multi-node job is running - mother superior goes down - Moab, pbs_server or an admin tries to delete the job - the job cannot be deleted because pbs_server cannot talk to mother superior. job_force_cancel_time, as you suspected, purges the job (qdel -p) if it still exists after the configured number of seconds. I think preferences for how long that should be will probably be somewhat different from site to site, but if I were setting the parameter, I would think about how long I think it could reasonably take to delete a job and perhaps multiply that by 3 or 4 and set it to that value. Something along those lines. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Mon Nov 21 14:01:48 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 21 Nov 2011 14:01:48 -0700 (MST) Subject: [torqueusers] Job Nanny Poll In-Reply-To: <4ECAA3E4.9020308@calculquebec.ca> Message-ID: <0ce6cf7f-d96f-4cde-8b5e-26d4a960e162@mail> ----- Original Message ----- > Hi David, > > Just a quick poll question - do people use the job delete nanny > > functionality in TORQUE? If you do, in qmgr you would have the > > line: > > > > set job_nanny = True > > > > I'm curious how many people are using it - this seems like very > > repetitive functionality to me (pbs_mom does pretty much the same > > thing already) and I personally think job_force_cancel_time is > > better, but I may be biased. > > > > We are using it on all our clusters to avoid mail flooding. How does > pbs_mom do the same thing? I had never heard of job_force_cancel_time > before. I guess that you want the job to be cancelled before there is > any mail flooding, don't you? > I was mistaken when I was saying pbs_mom does the same thing. I was actually thinking of another server parameter that does a similar thing - kill_delay, and somehow I was thinking the mom does this by default. We actually don't recommend using kill_delay because it is difficult to set up correctly (most people forget to make the shell catch the signal as well, so their job dies unexpectedly). I guess one of the things I'm thinking is that it is almost always communication problem if pbs_server can't delete a job, so I think job_force_cancel_time is better. However, I recognize that this may well not be the case for everyone. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From dbeer at adaptivecomputing.com Mon Nov 21 14:04:16 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 21 Nov 2011 14:04:16 -0700 (MST) Subject: [torqueusers] Job Nanny Poll In-Reply-To: <4ECAA246.3020408@byu.edu> Message-ID: <704ce4d5-2919-4249-8657-93648d9adafc@mail> ----- Original Message ----- > David, > > We currently have job_nanny set in our qmgr. I don't remember many > of > the details, but we were struggling a long time ago with some of the > issues related to stray jobs not being deleted, and someone at > Adaptive > (or maybe it was still CRI at the time), recommended we set it, as an > additional check. If there's a better way to do it, that's fine, but > many of us try really hard not to change things once they're working. > > That and we're not aware of the new features. I mean according to > the > changelog, that parameter was added in 2.4.9, and I think we've been > using job_nanny much longer than that. I wasn't aware of it until > your > email. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 11/21/2011 12:06 PM, David Beer wrote: > > All, > > > > Just a quick poll question - do people use the job delete nanny > > functionality in TORQUE? If you do, in qmgr you would have the > > line: > > > > set job_nanny = True > > > > I'm curious how many people are using it - this seems like very > > repetitive functionality to me (pbs_mom does pretty much the same > > thing already) and I personally think job_force_cancel_time is > > better, but I may be biased. > > > Allow me to re-ask my question in a different way - what is the use case for which you all are using the job_nanny feature? Just to dispel any fears, I'm only asking this for my own curiosity. I'm working on something that has me looking at that code and I'm just curious if people use it and what they use it for. This code is not being deleted or removed. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From stevejones at stanford.edu Mon Nov 21 14:24:57 2011 From: stevejones at stanford.edu (Steve Jones) Date: Mon, 21 Nov 2011 13:24:57 -0800 (PST) Subject: [torqueusers] Job Nanny Poll In-Reply-To: <704ce4d5-2919-4249-8657-93648d9adafc@mail> Message-ID: <2024056238.399663.1321910697253.JavaMail.root@zm09.stanford.edu> > Allow me to re-ask my question in a different way - what is the use > case for which you all are using the job_nanny feature? > > Just to dispel any fears, I'm only asking this for my own curiosity. > I'm working on something that has me looking at that code and I'm just > curious if people use it and what they use it for. This code is not > being deleted or removed. We implemented in hopes of a better job cleanup and removal strategy in hopes of it continuing to send KILL signals. What we've found is when a node completely stops responding the job *hangs* in the queue in a cancel in progess state. job_force_cancel_time should help with this, I'm thinking we'll implement it with a time of 5 minutes or so to allow temporary nodes failures to resolve themselves. I'm planning on using both. In a general sense I still have issues with processes left behind on compute nodes here and there. We're also using mom_job_sync, epilogue and epilogue.parallel scripts, all in an effort to kill unassigned (ghost) processes. I'd like to see more examples of how people are dealing with this. Steve From glen.beane at gmail.com Mon Nov 21 14:32:07 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Mon, 21 Nov 2011 16:32:07 -0500 Subject: [torqueusers] Job Nanny Poll In-Reply-To: <2eb52e5d-70b7-4321-bae5-3edbb7237288@mail> References: <2eb52e5d-70b7-4321-bae5-3edbb7237288@mail> Message-ID: <064CBFA1-33EA-447C-B325-E9E94423E2B6@gmail.com> On Nov 21, 2011, at 3:55 PM, David Beer wrote: > > > ----- Original Message ----- >> >> >> ----- Original Message ----- >>> On Mon, Nov 21, 2011 at 2:06 PM, David Beer >>> wrote: >>>> All, >>>> >>>> Just a quick poll question - do people use the job delete nanny >>>> functionality in TORQUE? If you do, in qmgr you would have the >>>> line: >>>> >>>> set job_nanny = True >>>> >>>> I'm curious how many people are using it - this seems like very >>>> repetitive functionality to me (pbs_mom does pretty much the same >>>> thing already) and I personally think job_force_cancel_time is >>>> better, but I may be biased. >>> >>> >>> I use the job delete nanny, but I am not familiar with >>> job_force_cancel_time. I have been using the job delete nanny for a >>> long time. >>> >>> What exactly does it do? I presume some of the multi-threading in >>> pbs_server in TORQUE 4.0 can clean up some of this code a little >>> since >>> pbs_server can spawn a thread to hang out and manage the job delete >>> (rather than needing to set a work task to check the status of the >>> delete in the future) >> >> I'm also using job_nanny, this is the first I've heard of >> job_force_cancel_time. Following a quick search it looks like it >> might have been undocumented for a short period. So it takes an int >> but there's not a recommended value, maybe 300? Is this the >> automation of 'qdel -p [jobid]' or similar for jobs *stuck* when a >> node stops responding? >> >> Steve >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > job_force_cancel_time is different from the nanny (some sites may wish to do both, even). I implemented job_force_cancel_time at the request of a customer for this use case: > > - Multi-node job is running > - mother superior goes down > - Moab, pbs_server or an admin tries to delete the job > - the job cannot be deleted because pbs_server cannot talk to mother superior. > > job_force_cancel_time, as you suspected, purges the job (qdel -p) if it still exists after the configured number of seconds. I think preferences for how long that should be will probably be somewhat different from site to site, but if I were setting the parameter, I would think about how long I think it could reasonably take to delete a job and perhaps multiply that by 3 or 4 and set it to that value. Something along those lines. > I would rather have the job requeued (if it is rerunnable) than just purged > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Mon Nov 21 14:39:57 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Mon, 21 Nov 2011 16:39:57 -0500 Subject: [torqueusers] Job Nanny Poll In-Reply-To: <704ce4d5-2919-4249-8657-93648d9adafc@mail> References: <704ce4d5-2919-4249-8657-93648d9adafc@mail> Message-ID: <33C84AB4-0EEB-4BDD-9713-4E4B31776D34@gmail.com> On Nov 21, 2011, at 4:04 PM, David Beer wrote: > > > ----- Original Message ----- >> David, >> >> We currently have job_nanny set in our qmgr. I don't remember many >> of >> the details, but we were struggling a long time ago with some of the >> issues related to stray jobs not being deleted, and someone at >> Adaptive >> (or maybe it was still CRI at the time), recommended we set it, as an >> additional check. If there's a better way to do it, that's fine, but >> many of us try really hard not to change things once they're working. >> >> That and we're not aware of the new features. I mean according to >> the >> changelog, that parameter was added in 2.4.9, and I think we've been >> using job_nanny much longer than that. I wasn't aware of it until >> your >> email. >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 11/21/2011 12:06 PM, David Beer wrote: >>> All, >>> >>> Just a quick poll question - do people use the job delete nanny >>> functionality in TORQUE? If you do, in qmgr you would have the >>> line: >>> >>> set job_nanny = True >>> >>> I'm curious how many people are using it - this seems like very >>> repetitive functionality to me (pbs_mom does pretty much the same >>> thing already) and I personally think job_force_cancel_time is >>> better, but I may be biased. >>> >> > > Allow me to re-ask my question in a different way - what is the use case for which you all are using the job_nanny feature? > > Just to dispel any fears, I'm only asking this for my own curiosity. I'm working on something that has me looking at that code and I'm just curious if people use it and what they use it for. This code is not being deleted or removed. > I think the use case is that the mom doesn't respond to the first delete request, but is actually still up With TCP instead of UDP I don't see that this would be much of a problem With a temporary failure or a dropped message I think this is better than a purge since the normal termination/cleanup happens > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From r.oostenveld at donders.ru.nl Mon Nov 21 14:57:41 2011 From: r.oostenveld at donders.ru.nl (Robert Oostenveld) Date: Mon, 21 Nov 2011 22:57:41 +0100 Subject: [torqueusers] qrerun fails due to Unauthorized Request In-Reply-To: References: Message-ID: <27DF65CE-35F7-4747-B2A3-7454E9D863A8@donders.ru.nl> Dear David, No, my regular account is not a manager, but root is. I had expected qrerun to work for a users' own jobs, and I had expected it to work from a submit client, not only on the torque server. I cannot make all ~200 regular users manager (imagine one of them doing "qdel all" on his runaway jobs), and I don't want to give them access to the server. My conclusion at the moment is that qrerun cannot be used to restart a matlab job at a later time in case the licences run out. Is there an overview somewhere on what commands the regular user can use? The man pages don't provide this information, and the error message is not very informative. The qrerun command and many others are installed by the torque-clients package in the bin directory; would it not be more appropriate to install it in sbin and only install it with the torque-server package? best regards, Robert On 18 Nov 2011, at 17:59, David Beer wrote: > Are the super user and or your user at that box managers on pbs_server? You would need manager privileges to qrerun a job. > > David > > ----- Original Message ----- >> Dear torque users, >> >> I am trying to use qrerun in a shell script to deal with the >> (potential) limit in available MATLAB licenses. Let me shortly >> outline the idea before explaining the problem. >> >> I have a shell script that starts MATLAB with the "-r " >> option for a MATLAB script. In case there is no license available, >> MATLAB returns immediately with a descriptive error about the >> license failure. I would like to catch that error and if it happens, >> issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job >> for execution at a later time. Note that I am aware of the ability >> to configure floating resources in moab, but I am using maui. >> Furthermore, the floating resources for the Matlab license don't >> optimally represent the license requirements for scheduling multiple >> jobs by the same user on a multicore machine. Hence I prefer to use >> qrerun instead of making the license a managed resource. >> >> The problem I run into can be summarized in the following snippet >> from the command line. I schedule a simple job that subsequenty >> starts running on one of the execution hosts: >> >> roboos at mentat001> echo sleep 1000 | qsub >> 45254.dccn-l014.dccn.nl >> >> Then I try to use qrerun, first as regular user then as super user >> (which I normally would not do of course): >> >> roboos at mentat001> qrerun 45254 >> qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl >> roboos at mentat001> sudo qrerun 45254 >> qrerun: Unauthorized Request MSG=operation not permitted >> 45254.dccn-l014.dccn.nl >> >> So as root/administrative user I am also not allowed to do it from >> the client machine. I am able to log in directly on the torque >> server, where as regular user I am also not allowed to qrerun. It is >> not a general failure of qrerun, since the the root user on the >> torque server is allowed to use it: >> >> roboos at mentat001> ssh torque >> roboos at torque> qrerun 45254 >> qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl >> roboos at torque> sudo qrerun 45254 >> >> after which the job is correctly requeued and starts over again. >> >> To provide some info from the log files: as regular user I get the >> following message in /var/spool/torque/server_logs >> >> 11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply >> code=15018(Request invalid for state of job), aux=0, type=RerunJob, >> from roboos at mentat001.dccn.nl >> >> and as root on the torque server I get >> >> 11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply >> code=15018(Request invalid for state of job), aux=0, type=RerunJob, >> from root at dccn-l014.dccn.nl >> >> The log mesaage is basically the same. In the log message on the >> execution host I cannot find anything that pertains to the failed >> qrerun request. >> >> Does anyone have an idea on what might be the problem for the regular >> user not being allowed to restart the job? I tried the same thing on >> a different torque cluster (not managed by me) that I have access >> to, and also there it failed. >> >> >> best regards, >> Robert >> >> >> >> ----------------------------------------------------------- >> Robert Oostenveld, PhD >> Senior Researcher & MEG Physicist >> Donders Institute for Brain, Cognition and Behaviour >> Centre for Cognitive Neuroimaging >> Radboud University Nijmegen >> tel.: +31 (0)24 3619695 >> e-mail: r.oostenveld at donders.ru.nl >> web: http://www.ru.nl/neuroimaging >> skype: r.oostenveld >> ----------------------------------------------------------- >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Mon Nov 21 15:52:42 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 21 Nov 2011 15:52:42 -0700 (MST) Subject: [torqueusers] qrerun fails due to Unauthorized Request In-Reply-To: <27DF65CE-35F7-4747-B2A3-7454E9D863A8@donders.ru.nl> Message-ID: <4472098e-f48e-4cd1-8c7c-a244965abdb0@mail> ----- Original Message ----- > Dear David, > > No, my regular account is not a manager, but root is. I had expected > qrerun to work for a users' own jobs, and I had expected it to work > from a submit client, not only on the torque server. I cannot make > all ~200 regular users manager (imagine one of them doing "qdel all" > on his runaway jobs), and I don't want to give them access to the > server. My conclusion at the moment is that qrerun cannot be used to > restart a matlab job at a later time in case the licences run out. > This may not interest you, but there are schedulers (such as Moab) that manage licences for you and would prevent this case. > Is there an overview somewhere on what commands the regular user can > use? The man pages don't provide this information, and the error > message is not very informative. > I don't know of an overview that exists explaining what commands a user can run at different permission levels, but that is a good request that can be done. As far as the error message - it seems to be appropriate to me - Unauthorized Request seems to be the correct message for a command that a user doesn't have permission to run. > The qrerun command and many others are installed by the > torque-clients package in the bin directory; would it not be more > appropriate to install it in sbin and only install it with the > torque-server package? > AFAIK, the way it is intended (this was done a long time before I began working on TORQUE, possibly before even Adaptive/CR die) is that the client commands go in bin and the commands that can't ever be run by users go in the sbin directory. qrun also is in the bin directory, even though it also cannot be run unless you are a manager. Cheers, David > best regards, > Robert > > > > On 18 Nov 2011, at 17:59, David Beer wrote: > > > Are the super user and or your user at that box managers on > > pbs_server? You would need manager privileges to qrerun a job. > > > > David > > > > ----- Original Message ----- > >> Dear torque users, > >> > >> I am trying to use qrerun in a shell script to deal with the > >> (potential) limit in available MATLAB licenses. Let me shortly > >> outline the idea before explaining the problem. > >> > >> I have a shell script that starts MATLAB with the "-r " > >> option for a MATLAB script. In case there is no license available, > >> MATLAB returns immediately with a descriptive error about the > >> license failure. I would like to catch that error and if it > >> happens, > >> issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job > >> for execution at a later time. Note that I am aware of the ability > >> to configure floating resources in moab, but I am using maui. > >> Furthermore, the floating resources for the Matlab license don't > >> optimally represent the license requirements for scheduling > >> multiple > >> jobs by the same user on a multicore machine. Hence I prefer to > >> use > >> qrerun instead of making the license a managed resource. > >> > >> The problem I run into can be summarized in the following snippet > >> from the command line. I schedule a simple job that subsequenty > >> starts running on one of the execution hosts: > >> > >> roboos at mentat001> echo sleep 1000 | qsub > >> 45254.dccn-l014.dccn.nl > >> > >> Then I try to use qrerun, first as regular user then as super user > >> (which I normally would not do of course): > >> > >> roboos at mentat001> qrerun 45254 > >> qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl > >> roboos at mentat001> sudo qrerun 45254 > >> qrerun: Unauthorized Request MSG=operation not permitted > >> 45254.dccn-l014.dccn.nl > >> > >> So as root/administrative user I am also not allowed to do it from > >> the client machine. I am able to log in directly on the torque > >> server, where as regular user I am also not allowed to qrerun. It > >> is > >> not a general failure of qrerun, since the the root user on the > >> torque server is allowed to use it: > >> > >> roboos at mentat001> ssh torque > >> roboos at torque> qrerun 45254 > >> qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl > >> roboos at torque> sudo qrerun 45254 > >> > >> after which the job is correctly requeued and starts over again. > >> > >> To provide some info from the log files: as regular user I get the > >> following message in /var/spool/torque/server_logs > >> > >> 11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply > >> code=15018(Request invalid for state of job), aux=0, > >> type=RerunJob, > >> from roboos at mentat001.dccn.nl > >> > >> and as root on the torque server I get > >> > >> 11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply > >> code=15018(Request invalid for state of job), aux=0, > >> type=RerunJob, > >> from root at dccn-l014.dccn.nl > >> > >> The log mesaage is basically the same. In the log message on the > >> execution host I cannot find anything that pertains to the failed > >> qrerun request. > >> > >> Does anyone have an idea on what might be the problem for the > >> regular > >> user not being allowed to restart the job? I tried the same thing > >> on > >> a different torque cluster (not managed by me) that I have access > >> to, and also there it failed. > >> > >> > >> best regards, > >> Robert > >> > >> > >> > >> ----------------------------------------------------------- > >> Robert Oostenveld, PhD > >> Senior Researcher & MEG Physicist > >> Donders Institute for Brain, Cognition and Behaviour > >> Centre for Cognitive Neuroimaging > >> Radboud University Nijmegen > >> tel.: +31 (0)24 3619695 > >> e-mail: r.oostenveld at donders.ru.nl > >> web: http://www.ru.nl/neuroimaging > >> skype: r.oostenveld > >> ----------------------------------------------------------- > >> > >> > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > > > > -- > > David Beer > > Direct Line: 801-717-3386 | Fax: 801-717-3738 > > Adaptive Computing > > 1712 S East Bay Blvd, Suite 300 > > Provo, UT 84606 > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From glen.beane at gmail.com Mon Nov 21 17:13:27 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Mon, 21 Nov 2011 19:13:27 -0500 Subject: [torqueusers] qrerun fails due to Unauthorized Request In-Reply-To: <4472098e-f48e-4cd1-8c7c-a244965abdb0@mail> References: <4472098e-f48e-4cd1-8c7c-a244965abdb0@mail> Message-ID: <28D59051-AD11-48E9-B6FC-B915E7E9DA47@gmail.com> On Nov 21, 2011, at 5:52 PM, David Beer wrote: > > > ----- Original Message ----- >> Dear David, >> >> No, my regular account is not a manager, but root is. I had expected >> qrerun to work for a users' own jobs, and I had expected it to work >> from a submit client, not only on the torque server. I cannot make >> all ~200 regular users manager (imagine one of them doing "qdel all" >> on his runaway jobs), and I don't want to give them access to the >> server. My conclusion at the moment is that qrerun cannot be used to >> restart a matlab job at a later time in case the licences run out. >> > > This may not interest you, but there are schedulers (such as Moab) that manage licences for you and would prevent this case. > >> Is there an overview somewhere on what commands the regular user can >> use? The man pages don't provide this information, and the error >> message is not very informative. >> > > I don't know of an overview that exists explaining what commands a user can run at different permission levels, but that is a good request that can be done. As far as the error message - it seems to be appropriate to me - Unauthorized Request seems to be the correct message for a command that a user doesn't have permission to run. > >> The qrerun command and many others are installed by the >> torque-clients package in the bin directory; would it not be more >> appropriate to install it in sbin and only install it with the >> torque-server package? >> > > AFAIK, the way it is intended (this was done a long time before I began working on TORQUE, possibly before even Adaptive/CR die) is that the client commands go in bin and the commands that can't ever be run by users go in the sbin directory. qrun also is in the bin directory, even though it also cannot be run unless you are a manager. > /usr/local/bin is appropriate in my opinion. A torque manager/operator does not have to be a super user and as such the commands do not belong in sbin > Cheers, > > David > >> best regards, >> Robert >> >> >> >> On 18 Nov 2011, at 17:59, David Beer wrote: >> >>> Are the super user and or your user at that box managers on >>> pbs_server? You would need manager privileges to qrerun a job. >>> >>> David >>> >>> ----- Original Message ----- >>>> Dear torque users, >>>> >>>> I am trying to use qrerun in a shell script to deal with the >>>> (potential) limit in available MATLAB licenses. Let me shortly >>>> outline the idea before explaining the problem. >>>> >>>> I have a shell script that starts MATLAB with the "-r " >>>> option for a MATLAB script. In case there is no license available, >>>> MATLAB returns immediately with a descriptive error about the >>>> license failure. I would like to catch that error and if it >>>> happens, >>>> issue "qalter -h u JOBID" and "qrerun JOBID" to reschedule the job >>>> for execution at a later time. Note that I am aware of the ability >>>> to configure floating resources in moab, but I am using maui. >>>> Furthermore, the floating resources for the Matlab license don't >>>> optimally represent the license requirements for scheduling >>>> multiple >>>> jobs by the same user on a multicore machine. Hence I prefer to >>>> use >>>> qrerun instead of making the license a managed resource. >>>> >>>> The problem I run into can be summarized in the following snippet >>>> from the command line. I schedule a simple job that subsequenty >>>> starts running on one of the execution hosts: >>>> >>>> roboos at mentat001> echo sleep 1000 | qsub >>>> 45254.dccn-l014.dccn.nl >>>> >>>> Then I try to use qrerun, first as regular user then as super user >>>> (which I normally would not do of course): >>>> >>>> roboos at mentat001> qrerun 45254 >>>> qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl >>>> roboos at mentat001> sudo qrerun 45254 >>>> qrerun: Unauthorized Request MSG=operation not permitted >>>> 45254.dccn-l014.dccn.nl >>>> >>>> So as root/administrative user I am also not allowed to do it from >>>> the client machine. I am able to log in directly on the torque >>>> server, where as regular user I am also not allowed to qrerun. It >>>> is >>>> not a general failure of qrerun, since the the root user on the >>>> torque server is allowed to use it: >>>> >>>> roboos at mentat001> ssh torque >>>> roboos at torque> qrerun 45254 >>>> qrerun: Unauthorized Request 45254.dccn-l014.dccn.nl >>>> roboos at torque> sudo qrerun 45254 >>>> >>>> after which the job is correctly requeued and starts over again. >>>> >>>> To provide some info from the log files: as regular user I get the >>>> following message in /var/spool/torque/server_logs >>>> >>>> 11/16/2011 09:36:55;0080;PBS_Server;Req;req_reject;Reject reply >>>> code=15018(Request invalid for state of job), aux=0, >>>> type=RerunJob, >>>> from roboos at mentat001.dccn.nl >>>> >>>> and as root on the torque server I get >>>> >>>> 11/16/2011 09:38:12;0080;PBS_Server;Req;req_reject;Reject reply >>>> code=15018(Request invalid for state of job), aux=0, >>>> type=RerunJob, >>>> from root at dccn-l014.dccn.nl >>>> >>>> The log mesaage is basically the same. In the log message on the >>>> execution host I cannot find anything that pertains to the failed >>>> qrerun request. >>>> >>>> Does anyone have an idea on what might be the problem for the >>>> regular >>>> user not being allowed to restart the job? I tried the same thing >>>> on >>>> a different torque cluster (not managed by me) that I have access >>>> to, and also there it failed. >>>> >>>> >>>> best regards, >>>> Robert >>>> >>>> >>>> >>>> ----------------------------------------------------------- >>>> Robert Oostenveld, PhD >>>> Senior Researcher & MEG Physicist >>>> Donders Institute for Brain, Cognition and Behaviour >>>> Centre for Cognitive Neuroimaging >>>> Radboud University Nijmegen >>>> tel.: +31 (0)24 3619695 >>>> e-mail: r.oostenveld at donders.ru.nl >>>> web: http://www.ru.nl/neuroimaging >>>> skype: r.oostenveld >>>> ----------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >>> -- >>> David Beer >>> Direct Line: 801-717-3386 | Fax: 801-717-3738 >>> Adaptive Computing >>> 1712 S East Bay Blvd, Suite 300 >>> Provo, UT 84606 >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > -- > David Beer > Direct Line: 801-717-3386 | Fax: 801-717-3738 > Adaptive Computing > 1712 S East Bay Blvd, Suite 300 > Provo, UT 84606 > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From JMRUSHTON at qinetiq.com Tue Nov 22 03:49:48 2011 From: JMRUSHTON at qinetiq.com (Rushton Martin) Date: Tue, 22 Nov 2011 10:49:48 -0000 Subject: [torqueusers] UC Torque deletes all jobs of user on same node In-Reply-To: References: Message-ID: <20111122104909.8887A83A802C@mail.adaptivecomputing.com> Have a look to see if you are running epilogue scripts that clean up /dev/shm for you. We had exactly the same issue, the script naively assumes that a user has only one job on the node and removes all the files belonging to the user which kills any other jobs. When running on just one system the files are not created. We have had to set the Moab configuration to include "NODEACCESSPOLICY UNIQUEUSER", I don't know if Maui has the equivalent. Regards, Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrushton at QinetiQ.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Robert Jacobi Sent: 20 November 2011 01:08 To: torqueusers at supercluster.org Subject: [torqueusers] Torque deletes all jobs of user on same node Hello All, We've recently run into a curious problem with torque. When a user deletes on of his jobs using "qdel jobid", and this job to be deleted spans more than one processor on the node, then all other jobs of the same user on the same node are canceled as well. If the deleted job only runs on one processor, then the other jobs of the user on the node are not affected and keep running. Thus it seems to me that whenever the pbs mom on the node has to delete from more than one processor it somehow indiscriminately tries to delete them from all processors and the other user's jobs might only be unaffected due to the lack of privileges over other users processes. At this point I've no clue how to further diagnose or solve this issue. I've tried to google this problem but couldn't find anything, so I hope you have an idea. Thank You, Robert -- Robert Jacobi Research Assistant University of Arizona Department of Aerospace & Mechanical Engineering 1130 N. Mountain Ave. Tucson, AZ, 85721-0119 tel: +1 (520) 621 4369 mail: rjacobi at email.arizona.edu The less time you spent on algebra in life, the more time you have to be a happy person. (Kerschen) Doubt is not a pleasant condition, but certainty is absurd. (Voltaire) All great truths begin as blasphemies. (Shaw) Denken ist etwas, das auf Schwierigkeiten folgt und dem das Handeln vorausgeht.(Brecht) _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. From RB.Ezhilalan at hse.ie Tue Nov 22 10:29:13 2011 From: RB.Ezhilalan at hse.ie (RB. Ezhilalan (Principal Physicist, CUH)) Date: Tue, 22 Nov 2011 17:29:13 -0000 Subject: [torqueusers] Parallel processing for MC code In-Reply-To: References: Message-ID: Hi Jason, I am glad that Torque now able to schedule jobs to two PCs. Thank you for steering me on the right direction, I defined available resources ....ncpus=2 and after that everything worked as I expected. Regards, Ezhilalan Ramalingam M.Sc.,DABR., Principal Physicist (Radiotherapy), Medical Physics Department, Cork University Hospital, Wilton, Cork Ireland Tel. 00353 21 4922533 Fax.00353 21 4921300 Email: rb.ezhilalan at hse.ie -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: 18 November 2011 16:45 To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 88, Issue 17 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: torqueusers Digest, Vol 88, Issue 16 (RB. Ezhilalan (Principal Physicist, CUH)) ---------------------------------------------------------------------- Message: 1 Date: Fri, 18 Nov 2011 16:44:30 -0000 From: "RB. Ezhilalan (Principal Physicist, CUH)" Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16 To: torqueusers at supercluster.org Message-ID: Content-Type: TEXT/plain; charset="us-ascii" Jason, I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this, the job ran on one core (linux-01) only. Then I removed the 'np' option from the list under the notion, the system will 'autodetect' the cores. Ezhilalan Ezhilalan Ramalingam M.Sc.,DABR., Principal Physicist (Radiotherapy), Medical Physics Department, Cork University Hospital, Wilton, Cork Ireland Tel. 00353 21 4922533 Fax.00353 21 4921300 Email: rb.ezhilalan at hse.ie -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: 18 November 2011 16:12 To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 88, Issue 16 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: Parallel processing for MC code (Jason Bacon) 2. Re: procs= not working as documented (Lance Westerhoff) 3. Re: procs= not working as documented (Steve Crusan) 4. Re: procs= not working as documented (Lance Westerhoff) ---------------------------------------------------------------------- Message: 1 Date: Fri, 18 Nov 2011 07:57:23 -0600 From: Jason Bacon Subject: Re: [torqueusers] Parallel processing for MC code To: Torque Users Mailing List Message-ID: <4EC66443.3080608 at tds.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed I was only wondering if you had "np=2" in the Linux-01 entry, or if Torque was configured to autodetect the number of cores and there were two. That would have explained the scheduling behavior. Regards, -J On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > Hi Jason, > > PC1 (linux-01) is a single core PC like PC2, I defined the > server_priv/nodes file as; > > Linux-01 > > Linux-02 > > As you have mentioned may be resource requirement needs to be properly > set up. Do you have any suggestions? > > Many thanks, > > Ezhilalan > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > torqueusers-request at supercluster.org > Sent: 17 November 2011 17:20 > To: torqueusers at supercluster.org > Subject: torqueusers Digest, Vol 88, Issue 14 > > Send torqueusers mailing list submissions to > > torqueusers at supercluster.org > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.supercluster.org/mailman/listinfo/torqueusers > > or, via email, send a message with subject or body 'help' to > > torqueusers-request at supercluster.org > > You can reach the person managing the list at > > torqueusers-owner at supercluster.org > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of torqueusers digest..." > > Today's Topics: > > 1. Re: Random SCP errors when transfering to/from CREAM sandbox > > (Christopher Samuel) > > 2. Re: Random SCP errors when transfering to/from CREAM sandbox > > (Gila Arrondo Miguel Angel) > > 3. Parallel processing for MC code > > (RB. Ezhilalan (Principal Physicist, CUH)) > > 4. Re: Parallel processing for MC code (Jason Bacon) > > 5. Re: File staging syntax (Steve Traylen) > > ---------------------------------------------------------------------- > > Message: 1 > > Date: Thu, 17 Nov 2011 13:29:44 +1100 > > From: Christopher Samuel > > Subject: Re: [torqueusers] Random SCP errors when transfering to/from > > CREAM sandbox > > To: torqueusers at supercluster.org > > Message-ID: <4EC47198.1040709 at unimelb.edu.au> > > Content-Type: text/plain; charset=ISO-8859-1 > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > Many thanks for your answer. We've made sure that the > > > keys are okay, as well as disabling hoskeychecking to > > > test it. > > Can you try and scp as that user to see whether it > > complains about anything else ? > > It may be that it is prompting the user to accept a > > host key if they don't already have it. > > cheers, > > Chris > > - -- > > Christopher Samuel - Senior Systems Administrator > > VLSCI - Victorian Life Sciences Computation Initiative > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > http://www.vlsci.unimelb.edu.au/ > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.11 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > =VPqK > > -----END PGP SIGNATURE----- > > ------------------------------ > > Message: 2 > > Date: Thu, 17 Nov 2011 07:55:50 +0000 > > From: "Gila Arrondo Miguel Angel" > > Subject: Re: [torqueusers] Random SCP errors when transfering to/from > > CREAM sandbox > > To: Torque Users Mailing List > > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch> > > Content-Type: text/plain; charset="us-ascii" > > Hi Chris, > > I've done that in many WNs and with different users, so I don't think > that is be the issue. I've also checked for scheduled tasks that > interact with the ssh keys, but the errors happen at random times, not > when the scheduled tasks run... :-S > > I'm running out of options here. > > Cheers, > > Miguel > > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > > Hash: SHA1 > > > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > >> Many thanks for your answer. We've made sure that the > > >> keys are okay, as well as disabling hoskeychecking to > > >> test it. > > > > > > Can you try and scp as that user to see whether it > > > complains about anything else ? > > > > > > It may be that it is prompting the user to accept a > > > host key if they don't already have it. > > > > > > cheers, > > > Chris > > > - -- > > > Christopher Samuel - Senior Systems Administrator > > > VLSCI - Victorian Life Sciences Computation Initiative > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > http://www.vlsci.unimelb.edu.au/ > > > > > > -----BEGIN PGP SIGNATURE----- > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > =VPqK > > > -----END PGP SIGNATURE----- > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > Miguel Gila > > CSCS Swiss National Supercomputing Centre > > HPC Solutions > > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland > > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 > > -------------- next part -------------- > > A non-text attachment was scrubbed... > > Name: smime.p7s > > Type: application/pkcs7-signature > > Size: 3239 bytes > > Desc: not available > > Url : > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2 14ea9d6/attachment-0001.bin > > ------------------------------ > > Message: 3 > > Date: Thu, 17 Nov 2011 10:14:32 -0000 > > From: "RB. Ezhilalan (Principal Physicist, CUH)" > > Subject: [torqueusers] Parallel processing for MC code > > To: torqueusers at supercluster.org > > Message-ID: > > > > Content-Type: text/plain; charset="us-ascii" > > Hi All, > > I've been trying to set up Torque queuing system on two SUSE10.1 linux > > PCs (PIII!). > > Installed the linux on both PCs, exported home directory containing > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH password > > less communication. All seems to be working fine. > > Downloaded latest version of Torque (number not handy) installed > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > PBS 'nodes' file was created as per guidelines, PBS_SERVER and QUEUE > > attributes were set as default. > > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are free. I > > am not sure whether this confirms PBS/Torque set up correctly. > > I was able to run an executable BEAMnrc user code in batch mode i.e > > using 'exb' command aliased to 'qsub' and sources a built in job script > > file with option p=1 (single job). > > To split the jobs in to two, so that it runs in parallel on the two PCs, > > option p=2 should be issued. However, what I noticed was, the job ran > > twice on the first PC (PC1) but not on both. > > I can't figure out what went wrong, I suspect PBS setup could have some > > issues, May be I can try running the job specifically on PC2 if so what > > command I need to give? > > I would be grateful for any advice! > > Kind Regards, > > Ezhilalan > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0 6e4a798/attachment-0001.html > > > ------------------------------ > > Message: 4 > > Date: Thu, 17 Nov 2011 10:18:18 -0600 > > From: Jason Bacon > > Subject: Re: [torqueusers] Parallel processing for MC code > > To: Torque Users Mailing List > > Message-ID: <4EC533CA.2000902 at tds.net> > > Content-Type: text/plain; charset=windows-1252; format=flowed > > How many cores does PC1 have? Note that Torque schedules cores, not > > computers, unless you specifically tell it to with resource requirements. > > Regards, > > -J > > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > > > Hi All, > > > > > > I?ve been trying to set up Torque queuing system on two SUSE10.1 linux > > > PCs (PIII!). > > > > > > Installed the linux on both PCs, exported home directory containing > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > > > password less communication. All seems to be working fine. > > > > > > Downloaded latest version of Torque (number not handy) installed > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and QUEUE > > > attributes were set as default. > > > > > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are free. > > > I am not sure whether this confirms PBS/Torque set up correctly. > > > > > > I was able to run an executable BEAMnrc user code in batch mode i.e > > > using ?exb? command aliased to ?qsub? and sources a built in job > > > script file with option p=1 (single job). > > > > > > To split the jobs in to two, so that it runs in parallel on the two > > > PCs, option p=2 should be issued. However, what I noticed was, the job > > > ran twice on the first PC (PC1) but not on both. > > > > > > I can?t figure out what went wrong, I suspect PBS setup could have > > > some issues, May be I can try running the job specifically on PC2 if > > > so what command I need to give? > > > > > > I would be grateful for any advice! > > > > > > Kind Regards, > > > > > > Ezhilalan > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Jason W. Bacon > > jwbacon at tds.net > > http://personalpages.tds.net/~jwbacon > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ------------------------------ > > Message: 5 > > Date: Thu, 17 Nov 2011 18:19:14 +0100 > > From: Steve Traylen > > Subject: Re: [torqueusers] File staging syntax > > To: Torque Users Mailing List > > Message-ID: > > > > Keywords: CERN SpamKiller Note: -50 > > Content-Type: text/plain; charset="ISO-8859-1" > > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson > > wrote> > > > Andr?, > > > > > > I have not yet had time to reproduce this. I did look through the > change log and there are two suspects. One is in 2.5.6, a fix for > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > > > > > That is as far as I am right now. I will try to get to this as soon > as I can. > > Hi Ken, > > Did you manage to track this down. It's currently making upgrading a > pain. > > Steve. > > -- > > Steve Traylen > > ------------------------------ > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > End of torqueusers Digest, Vol 88, Issue 14 > > ******************************************* > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Jason W. Bacon jwbacon at tds.net http://personalpages.tds.net/~jwbacon ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ------------------------------ Message: 2 Date: Fri, 18 Nov 2011 09:33:12 -0500 From: Lance Westerhoff Subject: Re: [torqueusers] procs= not working as documented To: torqueusers at supercluster.org Message-ID: Content-Type: text/plain; charset=us-ascii The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). -Lance > > Message: 3 > Date: Thu, 17 Nov 2011 17:29:17 -0800 > From: "Brock Palen" > Subject: Re: [torqueusers] procs= not working as documented > To: "Torque Users Mailing List" > Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > Content-Type: text/plain; charset="utf-8" > > Does maui only see one cpu or does mpiexec only see one cpu? > > > > Brock Palen > (734)936-1985 > brockp at umich.edu > - Sent from my Palm Pre, please excuse typos > On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: > > > > Hello All- > > > > It appears that when running with the following specs, the procs= option does not actually work as expected. > > > > ========================================== > > > > #PBS -S /bin/bash > > #PBS -l procs=60 > > #PBS -l pmem=700mb > > #PBS -l walltime=744:00:00 > > #PBS -j oe > > #PBS -q batch > > > > torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented > > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) > > > > ========================================== > > > > If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. > > > > Thank you for your time! > > > > -Lance > > > > ------------------------------ Message: 3 Date: Fri, 18 Nov 2011 09:47:24 -0500 From: Steve Crusan Subject: Re: [torqueusers] procs= not working as documented To: Torque Users Mailing List Message-ID: Content-Type: text/plain; charset=us-ascii -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). Hi Lance, Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. Might as well add your maui.cfg file also. I've found in the past that procs= is troublesome... > > I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. > > This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. > > -Lance > > >> >> Message: 3 >> Date: Thu, 17 Nov 2011 17:29:17 -0800 >> From: "Brock Palen" >> Subject: Re: [torqueusers] procs= not working as documented >> To: "Torque Users Mailing List" >> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >> Content-Type: text/plain; charset="utf-8" >> >> Does maui only see one cpu or does mpiexec only see one cpu? >> >> >> >> Brock Palen >> (734)936-1985 >> brockp at umich.edu >> - Sent from my Palm Pre, please excuse typos >> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >> >> >> >> Hello All- >> >> >> >> It appears that when running with the following specs, the procs= option does not actually work as expected. >> >> >> >> ========================================== >> >> >> >> #PBS -S /bin/bash >> >> #PBS -l procs=60 >> >> #PBS -l pmem=700mb >> >> #PBS -l walltime=744:00:00 >> >> #PBS -j oe >> >> #PBS -q batch >> >> >> >> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >> >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >> >> >> >> ========================================== >> >> >> >> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >> >> >> >> Thank you for your time! >> >> >> >> -Lance >> >> >> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= =AGW7 -----END PGP SIGNATURE----- ------------------------------ Message: 4 Date: Fri, 18 Nov 2011 11:12:06 -0500 From: Lance Westerhoff Subject: Re: [torqueusers] procs= not working as documented To: Torque Users Mailing List Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com> Content-Type: text/plain; charset=us-ascii Hi Steve- Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. Thanks for your help! =============== + head job.pbs #!/bin/bash #PBS -S /bin/bash #PBS -l procs=100 #PBS -l pmem=700mb #PBS -l walltime=744:00:00 #PBS -j oe #PBS -q batch Report run on Fri Nov 18 10:49:38 EST 2011 + pbsnodes --version version: 3.0.2 + diagnose --version maui client version 3.2.6p21 + checkjob 371010 checking job 371010 State: Running Creds: user:josh group:games class:batch qos:DEFAULT WallTime: 00:02:35 of 31:00:00:00 SubmitTime: Fri Nov 18 10:46:33 (Time Queued Total: 00:00:01 Eligible: 00:00:01) StartTime: Fri Nov 18 10:46:34 Total Tasks: 1 Req[0] TaskCount: 26 Partition: DEFAULT Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Dedicated Resources Per Task: PROCS: 1 MEM: 700M NodeCount: 10 Allocated Nodes: [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] [compute-0-13:2][compute-0-14:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00) PE: 26.00 StartPriority: 4716 + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" SERVERHOST gondor ADMIN1 maui root ADMIN3 ALL RMCFG[base] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:01:00 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 FSPOLICY DEDICATEDPS FSDEPTH 7 FSINTERVAL 86400 FSDECAY 0.50 FSWEIGHT 200 FSUSERWEIGHT 1 FSGROUPWEIGHT 1000 FSQOSWEIGHT 1000 FSACCOUNTWEIGHT 1 FSCLASSWEIGHT 1000 USERWEIGHT 4 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE RESERVATIONDEPTH 8 MAXJOBPERUSERPOLICY OFF MAXJOBPERUSERCOUNT 8 MAXPROCPERUSERPOLICY OFF MAXPROCPERUSERCOUNT 256 MAXPROCSECONDPERUSERPOLICY OFF MAXPROCSECONDPERUSERCOUNT 36864000 MAXJOBQUEUEDPERUSERPOLICY OFF MAXJOBQUEUEDPERUSERCOUNT 2 JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SHARED JOBMAXOVERRUN 99:00:00:00 DEFERCOUNT 8192 DEFERTIME 0 CLASSCFG[developer] FSTARGET=40.00+ CLASSCFG[lowprio] PRIORITY=-1000 SRCFG[developer] CLASSLIST=developer SRCFG[developer] ACCESS=dedicated SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri SRCFG[developer] STARTTIME=08:00:00 SRCFG[developer] ENDTIME=18:00:00 SRCFG[developer] TIMELIMIT=2:00:00 SRCFG[developer] RESOURCES=PROCS(8) USERCFG[DEFAULT] FSTARGET=100.0 =============== -Lance On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > >> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). > > Hi Lance, > > Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. > > Might as well add your maui.cfg file also. > > I've found in the past that procs= is troublesome... > >> >> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >> >> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). > > Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. > >> >> -Lance >> >> >>> >>> Message: 3 >>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>> From: "Brock Palen" >>> Subject: Re: [torqueusers] procs= not working as documented >>> To: "Torque Users Mailing List" >>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Does maui only see one cpu or does mpiexec only see one cpu? >>> >>> >>> >>> Brock Palen >>> (734)936-1985 >>> brockp at umich.edu >>> - Sent from my Palm Pre, please excuse typos >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>> >>> >>> >>> Hello All- >>> >>> >>> >>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>> >>> >>> >>> ========================================== >>> >>> >>> >>> #PBS -S /bin/bash >>> >>> #PBS -l procs=60 >>> >>> #PBS -l pmem=700mb >>> >>> #PBS -l walltime=744:00:00 >>> >>> #PBS -j oe >>> >>> #PBS -q batch >>> >>> >>> >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>> >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>> >>> >>> >>> ========================================== >>> >>> >>> >>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>> >>> >>> >>> Thank you for your time! >>> >>> >>> >>> -Lance >>> >>> >>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > =AGW7 > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 88, Issue 16 ******************************************* ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 88, Issue 17 ******************************************* From torsten at synapse.sri.com Tue Nov 22 14:00:26 2011 From: torsten at synapse.sri.com (Torsten Rohlfing) Date: Tue, 22 Nov 2011 13:00:26 -0800 Subject: [torqueusers] Determine job resources in job script? Message-ID: <4ECC0D6A.8010809@synapse.sri.com> Greeting! Using Torque (currently running 3.0.1) Is it possible to determine the resources allocated to (or requested for) a job from inside the job script? I am particularly interested in the "vmem" resource, which I hope to use to limit Java VM heap size automatically depending on the job submission arguments. Thanks! Torsten -- Torsten Rohlfing, PhD SRI International, Neuroscience Program Senior Research Scientist 333 Ravenswood Ave, Menlo Park, CA 94025 Phone: ++1 (650) 859-3379 Fax: ++1 (650) 859-2743 torsten at synapse.sri.com http://www.stanford.edu/~rohlfing/ "Though this be madness, yet there is a method in't" From vanw at sabalcore.com Tue Nov 22 14:11:32 2011 From: vanw at sabalcore.com (Kevin Van Workum) Date: Tue, 22 Nov 2011 16:11:32 -0500 Subject: [torqueusers] Determine job resources in job script? In-Reply-To: <4ECC0D6A.8010809@synapse.sri.com> References: <4ECC0D6A.8010809@synapse.sri.com> Message-ID: On Tue, Nov 22, 2011 at 4:00 PM, Torsten Rohlfing wrote: > Greeting! > > Using Torque (currently running 3.0.1) Is it possible to determine the > resources allocated to (or requested for) a job from inside the job script? > > I am particularly interested in the "vmem" resource, which I hope to use > to limit Java VM heap size automatically depending on the job submission > arguments. > Maybe you could parse the output of "qstat -f $PBS_JOBID | grep Resource_List.vmem". > > Thanks! > Torsten > > -- > Torsten Rohlfing, PhD SRI International, Neuroscience Program > Senior Research Scientist 333 Ravenswood Ave, Menlo Park, CA 94025 > Phone: ++1 (650) 859-3379 Fax: ++1 (650) 859-2743 > torsten at synapse.sri.com http://www.stanford.edu/~rohlfing/ > > "Though this be madness, yet there is a method in't" > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Kevin Van Workum, PhD Sabalcore Computing Inc. Run your code on 500 processors. Sign up for a free trial account. www.sabalcore.com 877-492-8027 ext. 11 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111122/ae1b6b15/attachment.html From bdandrus at nps.edu Tue Nov 22 14:15:29 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Tue, 22 Nov 2011 21:15:29 +0000 Subject: [torqueusers] Limit max jobs submitted Message-ID: All, Ok, using torque/moab latest versions. Have someone submitting array jobs with t>100000 This causes much grief. How to I limit a user from having X number of total jobs in the queue (to include each individual array element)? I see MAXJOBS, but that doesn't do it. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111122/90ea7a1e/attachment.html From lloyd_brown at byu.edu Tue Nov 22 14:20:39 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 22 Nov 2011 14:20:39 -0700 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: Message-ID: <4ECC1227.4000505@byu.edu> You could try one of the per-queue limits (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). In particular: - max_queuable - max_running - max_user_queuable - max_user_run Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: > All, > > > > Ok, using torque/moab latest versions. > > > > Have someone submitting array jobs with t>100000 > > This causes much grief. > > > > How to I limit a user from having X number of total jobs in the queue > (to include each individual array element)? > > > > I see MAXJOBS, but that doesn?t do it. > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Tue Nov 22 14:23:08 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Tue, 22 Nov 2011 16:23:08 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <4ECC1227.4000505@byu.edu> References: <4ECC1227.4000505@byu.edu> Message-ID: On Nov 22, 2011, at 4:20 PM, Lloyd Brown wrote: > You could try one of the per-queue limits > (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). > In particular: > > - max_queuable > - max_running > - max_user_queuable > - max_user_run > As of TORQUE 2.5 these should work correctly with job arrays > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: >> All, >> >> >> >> Ok, using torque/moab latest versions. >> >> >> >> Have someone submitting array jobs with t>100000 >> >> This causes much grief. >> >> >> >> How to I limit a user from having X number of total jobs in the queue >> (to include each individual array element)? >> >> >> >> I see MAXJOBS, but that doesn?t do it. >> >> >> >> >> >> Brian Andrus >> >> ITACS/Research Computing >> >> Naval Postgraduate School >> >> Monterey, California >> >> voice: 831-656-6238 >> >> >> >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From bdandrus at nps.edu Tue Nov 22 14:28:56 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Tue, 22 Nov 2011 21:28:56 +0000 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: <4ECC1227.4000505@byu.edu> Message-ID: They do work, but they do not do what I need. See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of glen.beane at gmail.com Sent: Tuesday, November 22, 2011 1:23 PM To: Torque Users Mailing List Cc: torqueusers at supercluster.org Subject: Re: [torqueusers] Limit max jobs submitted On Nov 22, 2011, at 4:20 PM, Lloyd Brown wrote: > You could try one of the per-queue limits > (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). > In particular: > > - max_queuable > - max_running > - max_user_queuable > - max_user_run > As of TORQUE 2.5 these should work correctly with job arrays > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: >> All, >> >> >> >> Ok, using torque/moab latest versions. >> >> >> >> Have someone submitting array jobs with t>100000 >> >> This causes much grief. >> >> >> >> How to I limit a user from having X number of total jobs in the queue >> (to include each individual array element)? >> >> >> >> I see MAXJOBS, but that doesn?t do it. >> >> >> >> >> >> Brian Andrus >> >> ITACS/Research Computing >> >> Naval Postgraduate School >> >> Monterey, California >> >> voice: 831-656-6238 >> >> >> >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Tue Nov 22 14:37:04 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Tue, 22 Nov 2011 16:37:04 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: <4ECC1227.4000505@byu.edu> Message-ID: <67726FAB-5CCF-4017-A680-CF244EB28FF3@gmail.com> Sent from my iPhone On Nov 22, 2011, at 4:28 PM, "Andrus, Brian Contractor" wrote: > They do work, but they do not do what I need. > > See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. > > That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. > Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. > > I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. > > Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? > Yes The torque parameters Lloyd mentioned and I confirmed > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of glen.beane at gmail.com > Sent: Tuesday, November 22, 2011 1:23 PM > To: Torque Users Mailing List > Cc: torqueusers at supercluster.org > Subject: Re: [torqueusers] Limit max jobs submitted > > > > On Nov 22, 2011, at 4:20 PM, Lloyd Brown wrote: > >> You could try one of the per-queue limits >> (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). >> In particular: >> >> - max_queuable >> - max_running >> - max_user_queuable >> - max_user_run >> > > As of TORQUE 2.5 these should work correctly with job arrays > >> >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: >>> All, >>> >>> >>> >>> Ok, using torque/moab latest versions. >>> >>> >>> >>> Have someone submitting array jobs with t>100000 >>> >>> This causes much grief. >>> >>> >>> >>> How to I limit a user from having X number of total jobs in the queue >>> (to include each individual array element)? >>> >>> >>> >>> I see MAXJOBS, but that doesn?t do it. >>> >>> >>> >>> >>> >>> Brian Andrus >>> >>> ITACS/Research Computing >>> >>> Naval Postgraduate School >>> >>> Monterey, California >>> >>> voice: 831-656-6238 >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lloyd_brown at byu.edu Tue Nov 22 14:38:02 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Tue, 22 Nov 2011 14:38:02 -0700 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: <4ECC1227.4000505@byu.edu> Message-ID: <4ECC163A.60505@byu.edu> Brian, There is a subtle but significant difference between the different parameters. MAXJOB is a moab parameter that specifies how many jobs it will pay attention to from it's resource manager (torque, in this case). What we're talking about are torque parameters that can limit the number of queued jobs either on a per-queue basis, or a per-user, per-queue basis. For example, if you never want to allow any user to queue more than 1000 jobs, you can do something like this: qmgr -c 'set queue batch max_user_queueable = 1000' Now, I don't know how job arrays would interact with this (especially given the discrepancy between torque arrays and moab arrays), but with traditional jobs, when they have 1000 or more enqueued total (no matter the state), qsub will refuse to queue any more for that user. Sorry that wasn't clear earlier. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 11/22/2011 02:28 PM, Andrus, Brian Contractor wrote: > They do work, but they do not do what I need. > > See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. > > That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. > Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. > > I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. > > Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > From scrusan at ur.rochester.edu Tue Nov 22 14:37:13 2011 From: scrusan at ur.rochester.edu (Steve Crusan) Date: Tue, 22 Nov 2011 16:37:13 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: <4ECC1227.4000505@byu.edu> Message-ID: <1C106683-87A5-4E6B-A381-4C275CAD4B3F@ur.rochester.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Nov 22, 2011, at 4:28 PM, Andrus, Brian Contractor wrote: > They do work, but they do not do what I need. > > See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. > > That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. > Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. I think your are looking for MAXIJOB (idle jobs). MAXIJOB was designed to prevent queue stuffing. For instance on our systems, we only allow a maximum of 50 idle jobs, meaning Moab will only actively increase priority and make attempts to schedule those jobs in the idle queue. Any jobs above the 50 are placed in a blocked state, and they are basically ignored. http://www.adaptivecomputing.com/resources/docs/mwm/6-0/6.2throttlingpolicies.php#idle > > I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. > > Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of glen.beane at gmail.com > Sent: Tuesday, November 22, 2011 1:23 PM > To: Torque Users Mailing List > Cc: torqueusers at supercluster.org > Subject: Re: [torqueusers] Limit max jobs submitted > > > > On Nov 22, 2011, at 4:20 PM, Lloyd Brown wrote: > >> You could try one of the per-queue limits >> (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). >> In particular: >> >> - max_queuable >> - max_running >> - max_user_queuable >> - max_user_run >> > > As of TORQUE 2.5 these should work correctly with job arrays > >> >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: >>> All, >>> >>> >>> >>> Ok, using torque/moab latest versions. >>> >>> >>> >>> Have someone submitting array jobs with t>100000 >>> >>> This causes much grief. >>> >>> >>> >>> How to I limit a user from having X number of total jobs in the queue >>> (to include each individual array element)? >>> >>> >>> >>> I see MAXJOBS, but that doesn?t do it. >>> >>> >>> >>> >>> >>> Brian Andrus >>> >>> ITACS/Research Computing >>> >>> Naval Postgraduate School >>> >>> Monterey, California >>> >>> voice: 831-656-6238 >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers ---------------------- Steve Crusan System Administrator Center for Research Computing University of Rochester https://www.crc.rochester.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org iQEcBAEBAgAGBQJOzBYTAAoJENS19LGOpgqKnqQIAIQxzt3cSjJK5Q8cCgMtlScX YDttvfWlVaEh7dDV/zL2/MsbDiYBuTBPXb3xGOqvNcalR8SVYMJrFtRW7AkrwBpK RG3c8uzL6yqGnIEirWbEFBrdy0FK7+LRILEEg2zJx5OEX2CePECrXbLqvJwka+PX TPdSFNfYayrQ4RUxInUK7/wLI/x1oBoHq39O2yHlvbh+X043epnO2RfmvOqGW3D7 q1/S+BbIyy+ZR4NM+inXJ9FvOeSRus+ytw4ZKd1KsnX6lyxmb7dGsF9xUWmmrh+G pupR/eqOf9YhXzO0vyF2QfAzAAi53qXeziR1jda+gB+Iw8PuKDwtU79MMCMtaAg= =8w9h -----END PGP SIGNATURE----- From glen.beane at gmail.com Tue Nov 22 14:39:58 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Tue, 22 Nov 2011 16:39:58 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <67726FAB-5CCF-4017-A680-CF244EB28FF3@gmail.com> References: <4ECC1227.4000505@byu.edu> <67726FAB-5CCF-4017-A680-CF244EB28FF3@gmail.com> Message-ID: <66C42F69-0E87-421F-BDE7-36C0A3BE1F0D@gmail.com> On Nov 22, 2011, at 4:37 PM, glen.beane at gmail.com wrote: > > > Sent from my iPhone > > On Nov 22, 2011, at 4:28 PM, "Andrus, Brian Contractor" wrote: > >> They do work, but they do not do what I need. >> >> See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. >> >> That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. >> Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. >> >> I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. >> >> Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? >> > > Yes > The torque parameters Lloyd mentioned and I confirmed > > Specifically you want max_user_queable > >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 >> >> >> >> -----Original Message----- >> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of glen.beane at gmail.com >> Sent: Tuesday, November 22, 2011 1:23 PM >> To: Torque Users Mailing List >> Cc: torqueusers at supercluster.org >> Subject: Re: [torqueusers] Limit max jobs submitted >> >> >> >> On Nov 22, 2011, at 4:20 PM, Lloyd Brown wrote: >> >>> You could try one of the per-queue limits >>> (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). >>> In particular: >>> >>> - max_queuable >>> - max_running >>> - max_user_queuable >>> - max_user_run >>> >> >> As of TORQUE 2.5 these should work correctly with job arrays >> >>> >>> >>> Lloyd Brown >>> Systems Administrator >>> Fulton Supercomputing Lab >>> Brigham Young University >>> http://marylou.byu.edu >>> >>> On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: >>>> All, >>>> >>>> >>>> >>>> Ok, using torque/moab latest versions. >>>> >>>> >>>> >>>> Have someone submitting array jobs with t>100000 >>>> >>>> This causes much grief. >>>> >>>> >>>> >>>> How to I limit a user from having X number of total jobs in the queue >>>> (to include each individual array element)? >>>> >>>> >>>> >>>> I see MAXJOBS, but that doesn?t do it. >>>> >>>> >>>> >>>> >>>> >>>> Brian Andrus >>>> >>>> ITACS/Research Computing >>>> >>>> Naval Postgraduate School >>>> >>>> Monterey, California >>>> >>>> voice: 831-656-6238 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Tue Nov 22 14:41:16 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 22 Nov 2011 14:41:16 -0700 (MST) Subject: [torqueusers] Limit max jobs submitted In-Reply-To: Message-ID: <15e81119-ddba-4d12-b528-32ff4435e61d@mail> ----- Original Message ----- > They do work, but they do not do what I need. > > See when someone submits >100000 array jobs, it fills up the job list > that is used to schedule the jobs. > > That is MAXJOB tells how many jobs to work with within moab to decide > priority and who to run. So if MAXJOB is set to 50000, and someone > submits an array of 100000, then 1/2 of their jobs get pulled in and > the rest are ignored (for now) by moab. > Now along comes supersensitive.user who submits his interactive job, > which will sit for way too long because moab isn't even going to > schedule it. In fact, moab is ignoring it. > > I could set MAXJOB to 500000, but that still doesn't prevent a user > from submitting too many jobs such that the list that is looked at > does not over-fill. > > Is there a setting were if someone were to submit >X jobs (array or > otherwise), torque/moab will not even allow it in? > Moab will reject jobs that exceed MAXJOB if they are submitted via msub. To govern jobs that are submitted using qsub, you have to use the TORQUE parameters. It can become a little tedious to manage, especially if you are using multiple queues, but that's what is currently available to you. Does that take care of what you're trying to do? David > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > glen.beane at gmail.com > Sent: Tuesday, November 22, 2011 1:23 PM > To: Torque Users Mailing List > Cc: torqueusers at supercluster.org > Subject: Re: [torqueusers] Limit max jobs submitted > > > > On Nov 22, 2011, at 4:20 PM, Lloyd Brown wrote: > > > You could try one of the per-queue limits > > (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). > > In particular: > > > > - max_queuable > > - max_running > > - max_user_queuable > > - max_user_run > > > > As of TORQUE 2.5 these should work correctly with job arrays > > > > > > > Lloyd Brown > > Systems Administrator > > Fulton Supercomputing Lab > > Brigham Young University > > http://marylou.byu.edu > > > > On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: > >> All, > >> > >> > >> > >> Ok, using torque/moab latest versions. > >> > >> > >> > >> Have someone submitting array jobs with t>100000 > >> > >> This causes much grief. > >> > >> > >> > >> How to I limit a user from having X number of total jobs in the > >> queue > >> (to include each individual array element)? > >> > >> > >> > >> I see MAXJOBS, but that doesn?t do it. > >> > >> > >> > >> > >> > >> Brian Andrus > >> > >> ITACS/Research Computing > >> > >> Naval Postgraduate School > >> > >> Monterey, California > >> > >> voice: 831-656-6238 > >> > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From glen.beane at gmail.com Tue Nov 22 14:41:45 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Tue, 22 Nov 2011 16:41:45 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <4ECC163A.60505@byu.edu> References: <4ECC1227.4000505@byu.edu> <4ECC163A.60505@byu.edu> Message-ID: On Nov 22, 2011, at 4:38 PM, Lloyd Brown wrote: > Brian, > > There is a subtle but significant difference between the different > parameters. > > MAXJOB is a moab parameter that specifies how many jobs it will pay > attention to from it's resource manager (torque, in this case). > > What we're talking about are torque parameters that can limit the number > of queued jobs either on a per-queue basis, or a per-user, per-queue basis. > > For example, if you never want to allow any user to queue more than 1000 > jobs, you can do something like this: > > qmgr -c 'set queue batch max_user_queueable = 1000' > > Now, I don't know how job arrays would interact with this (especially > given the discrepancy between torque arrays and moab arrays), but with > traditional jobs, when they have 1000 or more enqueued total (no matter > the state), qsub will refuse to queue any more for that user. > As of torque 2.5 these attributes are checked during array submission > Sorry that wasn't clear earlier. > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 11/22/2011 02:28 PM, Andrus, Brian Contractor wrote: >> They do work, but they do not do what I need. >> >> See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. >> >> That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. >> Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. >> >> I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. >> >> Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? >> >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From dbeer at adaptivecomputing.com Tue Nov 22 14:54:04 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Tue, 22 Nov 2011 14:54:04 -0700 (MST) Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <1C106683-87A5-4E6B-A381-4C275CAD4B3F@ur.rochester.edu> Message-ID: ----- Original Message ----- > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Nov 22, 2011, at 4:28 PM, Andrus, Brian Contractor wrote: > > > They do work, but they do not do what I need. > > > > See when someone submits >100000 array jobs, it fills up the job > > list that is used to schedule the jobs. > > > > That is MAXJOB tells how many jobs to work with within moab to > > decide priority and who to run. So if MAXJOB is set to 50000, and > > someone submits an array of 100000, then 1/2 of their jobs get > > pulled in and the rest are ignored (for now) by moab. > > Now along comes supersensitive.user who submits his interactive > > job, which will sit for way too long because moab isn't even going > > to schedule it. In fact, moab is ignoring it. > > I think your are looking for MAXIJOB (idle jobs). MAXIJOB was > designed to prevent queue stuffing. For instance on our systems, we > only allow a maximum of 50 idle jobs, meaning Moab will only > actively increase priority and make attempts to schedule those jobs > in the idle queue. Any jobs above the 50 are placed in a blocked > state, and they are basically ignored. > > http://www.adaptivecomputing.com/resources/docs/mwm/6-0/6.2throttlingpolicies.php#idle > This won't completely reject the jobs, but it will help. It will prevent a user from stuffing the queue with jobs, but it won't get Moab to reject things submitted via qsub. Generally speaking, it is not possible for Moab to do that. David > > > > > I could set MAXJOB to 500000, but that still doesn't prevent a user > > from submitting too many jobs such that the list that is looked at > > does not over-fill. > > > > Is there a setting were if someone were to submit >X jobs (array or > > otherwise), torque/moab will not even allow it in? > > > > > > Brian Andrus > > ITACS/Research Computing > > Naval Postgraduate School > > Monterey, California > > voice: 831-656-6238 > > > > > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > > glen.beane at gmail.com > > Sent: Tuesday, November 22, 2011 1:23 PM > > To: Torque Users Mailing List > > Cc: torqueusers at supercluster.org > > Subject: Re: [torqueusers] Limit max jobs submitted > > > > > > > > On Nov 22, 2011, at 4:20 PM, Lloyd Brown > > wrote: > > > >> You could try one of the per-queue limits > >> (http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php). > >> In particular: > >> > >> - max_queuable > >> - max_running > >> - max_user_queuable > >> - max_user_run > >> > > > > As of TORQUE 2.5 these should work correctly with job arrays > > > >> > >> > >> Lloyd Brown > >> Systems Administrator > >> Fulton Supercomputing Lab > >> Brigham Young University > >> http://marylou.byu.edu > >> > >> On 11/22/2011 02:15 PM, Andrus, Brian Contractor wrote: > >>> All, > >>> > >>> > >>> > >>> Ok, using torque/moab latest versions. > >>> > >>> > >>> > >>> Have someone submitting array jobs with t>100000 > >>> > >>> This causes much grief. > >>> > >>> > >>> > >>> How to I limit a user from having X number of total jobs in the > >>> queue > >>> (to include each individual array element)? > >>> > >>> > >>> > >>> I see MAXJOBS, but that doesn?t do it. > >>> > >>> > >>> > >>> > >>> > >>> Brian Andrus > >>> > >>> ITACS/Research Computing > >>> > >>> Naval Postgraduate School > >>> > >>> Monterey, California > >>> > >>> voice: 831-656-6238 > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOzBYTAAoJENS19LGOpgqKnqQIAIQxzt3cSjJK5Q8cCgMtlScX > YDttvfWlVaEh7dDV/zL2/MsbDiYBuTBPXb3xGOqvNcalR8SVYMJrFtRW7AkrwBpK > RG3c8uzL6yqGnIEirWbEFBrdy0FK7+LRILEEg2zJx5OEX2CePECrXbLqvJwka+PX > TPdSFNfYayrQ4RUxInUK7/wLI/x1oBoHq39O2yHlvbh+X043epnO2RfmvOqGW3D7 > q1/S+BbIyy+ZR4NM+inXJ9FvOeSRus+ytw4ZKd1KsnX6lyxmb7dGsF9xUWmmrh+G > pupR/eqOf9YhXzO0vyF2QfAzAAi53qXeziR1jda+gB+Iw8PuKDwtU79MMCMtaAg= > =8w9h > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From siegert at sfu.ca Tue Nov 22 15:45:30 2011 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 22 Nov 2011 14:45:30 -0800 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: <4ECC1227.4000505@byu.edu> Message-ID: <20111122224530.GA13351@stikine.sfu.ca> Hi Brian, On Tue, Nov 22, 2011 at 09:28:56PM +0000, Andrus, Brian Contractor wrote: > They do work, but they do not do what I need. > > See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. > > That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. > Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. > > I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. > > Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 I ran into exactly the same problem a few weeks ago. Currently the only way to prevent a user from overloading moab and thus preventing it from scheduling jobs in priority order is to 1) set MAXJOB to some value X 2) use set queue exec max_user_queuable = Y for the execution queues AND additionally set set queue rte max_user_queuable = Y for ALL routing queues that route jobs to the relevant execution queues. Y must be much smaller than X. Unfortunately, there currently is no limit like set server max_user_queuable = Y which would be a more logical way of preventing this denial-of-service attack against moab. Cheers, Martin -- Martin Siegert Simon Fraser University From bdandrus at nps.edu Tue Nov 22 17:53:13 2011 From: bdandrus at nps.edu (Andrus, Brian Contractor) Date: Wed, 23 Nov 2011 00:53:13 +0000 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <20111122224530.GA13351@stikine.sfu.ca> References: <4ECC1227.4000505@byu.edu> <20111122224530.GA13351@stikine.sfu.ca> Message-ID: Martin, Thanks! That looks like what I need. I have a routing queue and folks use qsub rather than msub. So I will limit how many they are allowed to have in the routing queue. Hopefully this will do the trick. Brian On Nov 22, 2011, at 2:45 PM, Martin Siegert wrote: > Hi Brian, > > On Tue, Nov 22, 2011 at 09:28:56PM +0000, Andrus, Brian Contractor wrote: >> They do work, but they do not do what I need. >> >> See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. >> >> That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. >> Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. >> >> I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. >> >> Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? >> >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 > > I ran into exactly the same problem a few weeks ago. > Currently the only way to prevent a user from overloading moab and thus > preventing it from scheduling jobs in priority order is to > > 1) set MAXJOB to some value X > 2) use > > set queue exec max_user_queuable = Y > > for the execution queues AND additionally set > > set queue rte max_user_queuable = Y > > for ALL routing queues that route jobs to the relevant execution queues. > > Y must be much smaller than X. > Unfortunately, there currently is no limit like > > set server max_user_queuable = Y > > which would be a more logical way of preventing this denial-of-service > attack against moab. > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University From glen.beane at gmail.com Tue Nov 22 18:24:55 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Tue, 22 Nov 2011 20:24:55 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <20111122224530.GA13351@stikine.sfu.ca> References: <4ECC1227.4000505@byu.edu> <20111122224530.GA13351@stikine.sfu.ca> Message-ID: On Nov 22, 2011, at 5:45 PM, Martin Siegert wrote: > Hi Brian, > > On Tue, Nov 22, 2011 at 09:28:56PM +0000, Andrus, Brian Contractor wrote: >> They do work, but they do not do what I need. >> >> See when someone submits >100000 array jobs, it fills up the job list that is used to schedule the jobs. >> >> That is MAXJOB tells how many jobs to work with within moab to decide priority and who to run. So if MAXJOB is set to 50000, and someone submits an array of 100000, then 1/2 of their jobs get pulled in and the rest are ignored (for now) by moab. >> Now along comes supersensitive.user who submits his interactive job, which will sit for way too long because moab isn't even going to schedule it. In fact, moab is ignoring it. >> >> I could set MAXJOB to 500000, but that still doesn't prevent a user from submitting too many jobs such that the list that is looked at does not over-fill. >> >> Is there a setting were if someone were to submit >X jobs (array or otherwise), torque/moab will not even allow it in? >> >> >> Brian Andrus >> ITACS/Research Computing >> Naval Postgraduate School >> Monterey, California >> voice: 831-656-6238 > > I ran into exactly the same problem a few weeks ago. > Currently the only way to prevent a user from overloading moab and thus > preventing it from scheduling jobs in priority order is to > > 1) set MAXJOB to some value X > 2) use > > set queue exec max_user_queuable = Y > > for the execution queues AND additionally set > > set queue rte max_user_queuable = Y > > for ALL routing queues that route jobs to the relevant execution queues. > > Y must be much smaller than X. > Unfortunately, there currently is no limit like > > set server max_user_queuable = Y > This is a reasonable feature request. I have been thinking about implementing it for a few years... > which would be a more logical way of preventing this denial-of-service > attack against moab. > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Tue Nov 22 20:03:18 2011 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Wed, 23 Nov 2011 14:03:18 +1100 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <1C106683-87A5-4E6B-A381-4C275CAD4B3F@ur.rochester.edu> References: <4ECC1227.4000505@byu.edu> <1C106683-87A5-4E6B-A381-4C275CAD4B3F@ur.rochester.edu> Message-ID: <4ECC6276.9030401@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 23/11/11 08:37, Steve Crusan wrote: > I think your are looking for MAXIJOB (idle jobs). Note that if you use MAXIJOB you might want to also set (in the moab.cfg on the login nodes): DISPLAYFLAGS HIDEBLOCKED This will mean users won't see other users blocked jobs with showq (makes them less worried that their job will ever run) but still shows users their own blocked jobs. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7MYnYACgkQO2KABBYQAh8KowCglCy6pyN88zKE9k5Nqzyhs+BW hHwAnA7ZtDn7KYyfWKlBUrI4nx/1G0F3 =shFR -----END PGP SIGNATURE----- From Gareth.Williams at csiro.au Tue Nov 22 20:28:35 2011 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 23 Nov 2011 14:28:35 +1100 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: References: <4ECC1227.4000505@byu.edu> <20111122224530.GA13351@stikine.sfu.ca> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102C4425620@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: glen.beane at gmail.com [mailto:glen.beane at gmail.com] > Sent: Wednesday, 23 November 2011 12:25 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Limit max jobs submitted > > On Nov 22, 2011, at 5:45 PM, Martin Siegert wrote: > > > Hi Brian, > > > > On Tue, Nov 22, 2011 at 09:28:56PM +0000, Andrus, Brian Contractor > wrote: > >> They do work, but they do not do what I need. > >> > >> See when someone submits >100000 array jobs, it fills up the job > list that is used to schedule the jobs. > >> > >> That is MAXJOB tells how many jobs to work with within moab to > decide priority and who to run. So if MAXJOB is set to 50000, and > someone submits an array of 100000, then 1/2 of their jobs get pulled > in and the rest are ignored (for now) by moab. > >> Now along comes supersensitive.user who submits his interactive job, > which will sit for way too long because moab isn't even going to > schedule it. In fact, moab is ignoring it. > >> > >> I could set MAXJOB to 500000, but that still doesn't prevent a user > from submitting too many jobs such that the list that is looked at does > not over-fill. > >> > >> Is there a setting were if someone were to submit >X jobs (array or > otherwise), torque/moab will not even allow it in? > >> > >> > >> Brian Andrus > >> ITACS/Research Computing > >> Naval Postgraduate School > >> Monterey, California > >> voice: 831-656-6238 > > > > I ran into exactly the same problem a few weeks ago. > > Currently the only way to prevent a user from overloading moab and > thus > > preventing it from scheduling jobs in priority order is to > > > > 1) set MAXJOB to some value X > > 2) use > > > > set queue exec max_user_queuable = Y > > > > for the execution queues AND additionally set > > > > set queue rte max_user_queuable = Y > > > > for ALL routing queues that route jobs to the relevant execution > queues. > > > > Y must be much smaller than X. > > Unfortunately, there currently is no limit like > > > > set server max_user_queuable = Y > > > > > This is a reasonable feature request. I have been thinking about > implementing it for a few years... Hi Glen (and all), I find your statement confusing. Do you mean you have been thinking about adding this to your torque configuration or and you thinking of implementing a new feature in torque? On a related note, I posted a while ago that such a routing queue setup with limits does not play nicely with job dependencies. I'd probably move to such a setup if this issue were resolved. There would still be a potential issue of a user wanting to submit higher priority jobs when they already have filled their limit. I guess they have to hold their existing queued jobs if they want to handle such a situation. It would be nice if the scheduler was more graceful about dealing with many jobs but I can see there are technical issues that make that hard. We recently had to increase limits in moab to get more jobs considered but then ran out of memory which was a worse issue. Gareth > > > > > > which would be a more logical way of preventing this denial-of- > service > > attack against moab. > > > > Cheers, > > Martin > > > > -- > > Martin Siegert > > Simon Fraser University > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Tue Nov 22 20:49:27 2011 From: glen.beane at gmail.com (glen.beane at gmail.com) Date: Tue, 22 Nov 2011 22:49:27 -0500 Subject: [torqueusers] Limit max jobs submitted In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102C4425620@exvic-mbx04.nexus.csiro.au> References: <4ECC1227.4000505@byu.edu> <20111122224530.GA13351@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102C4425620@exvic-mbx04.nexus.csiro.au> Message-ID: <4CA41F2E-21B3-444C-83DE-7585507E9DB3@gmail.com> On Nov 22, 2011, at 10:28 PM, wrote: >> -----Original Message----- >> From: glen.beane at gmail.com [mailto:glen.beane at gmail.com] >> Sent: Wednesday, 23 November 2011 12:25 PM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] Limit max jobs submitted >> >> On Nov 22, 2011, at 5:45 PM, Martin Siegert wrote: >> >>> Hi Brian, >>> >>> On Tue, Nov 22, 2011 at 09:28:56PM +0000, Andrus, Brian Contractor >> wrote: >>>> They do work, but they do not do what I need. >>>> >>>> See when someone submits >100000 array jobs, it fills up the job >> list that is used to schedule the jobs. >>>> >>>> That is MAXJOB tells how many jobs to work with within moab to >> decide priority and who to run. So if MAXJOB is set to 50000, and >> someone submits an array of 100000, then 1/2 of their jobs get pulled >> in and the rest are ignored (for now) by moab. >>>> Now along comes supersensitive.user who submits his interactive job, >> which will sit for way too long because moab isn't even going to >> schedule it. In fact, moab is ignoring it. >>>> >>>> I could set MAXJOB to 500000, but that still doesn't prevent a user >> from submitting too many jobs such that the list that is looked at does >> not over-fill. >>>> >>>> Is there a setting were if someone were to submit >X jobs (array or >> otherwise), torque/moab will not even allow it in? >>>> >>>> >>>> Brian Andrus >>>> ITACS/Research Computing >>>> Naval Postgraduate School >>>> Monterey, California >>>> voice: 831-656-6238 >>> >>> I ran into exactly the same problem a few weeks ago. >>> Currently the only way to prevent a user from overloading moab and >> thus >>> preventing it from scheduling jobs in priority order is to >>> >>> 1) set MAXJOB to some value X >>> 2) use >>> >>> set queue exec max_user_queuable = Y >>> >>> for the execution queues AND additionally set >>> >>> set queue rte max_user_queuable = Y >>> >>> for ALL routing queues that route jobs to the relevant execution >> queues. >>> >>> Y must be much smaller than X. >>> Unfortunately, there currently is no limit like >>> >>> set server max_user_queuable = Y >>> >> >> >> This is a reasonable feature request. I have been thinking about >> implementing it for a few years... > > Hi Glen (and all), > > I find your statement confusing. Do you mean you have been thinking about adding this to your torque configuration or and you thinking of implementing a new feature in torque? > New feature > On a related note, I posted a while ago that such a routing queue setup with limits does not play nicely with job dependencies. I'd probably move to such a setup if this issue were resolved. There would still be a potential issue of a user wanting to submit higher priority jobs when they already have filled their limit. I guess they have to hold their existing queued jobs if they want to handle such a situation. > > It would be nice if the scheduler was more graceful about dealing with many jobs but I can see there are technical issues that make that hard. We recently had to increase limits in moab to get more jobs considered but then ran out of memory which was a worse issue. > > Gareth > >> >> >> >> >>> which would be a more logical way of preventing this denial-of- >> service >>> attack against moab. >>> >>> Cheers, >>> Martin >>> >>> -- >>> Martin Siegert >>> Simon Fraser University >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Wed Nov 23 10:12:18 2011 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 23 Nov 2011 12:12:18 -0500 Subject: [torqueusers] pbs_nodefile: undefined variable Message-ID: <475A5A33-146F-4051-9C17-7289517D9D0D@nyu.edu> Hello Everyone, Yesterday I updated the file nodes in server_priv directory manually. We have login nodes at the end of this file. I wanted to add new nodes. I used qmgr command create to add the new nodes. Pbstop started showing the new nodes after the login nodes. So I used the delete command to take them off the list, and then added new nodes with create command, login nodes. Still they were added in the same order before. So I manually edited the file to put the login nodes at the end of the file, i.e., right after the new nodes. Do you think it could break something? Or it is caused by something else? Since then some users are getting the error "PBS_NODEFILE: undefined variable". I am thinking something definitely broke because of what I did. Does anyone have any idea how to fix this? Can I install the same version again on just master node and login nodes with out having to install it on all the compute nodes? All mpi jobs seem to be affected. Strange thing is it works ok for most of the users. Yesterday, I disabled the qsub wrapper and it was ok. Now I realize it has nothing to do with qsub wrapper since the same user has got the same error this morning again. I I would really appreciate any help. Thanks, Sreedhar Manchu. --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111123/80876673/attachment.html From ljjlp03 at gmail.com Wed Nov 23 10:18:27 2011 From: ljjlp03 at gmail.com (liu junjun) Date: Thu, 24 Nov 2011 01:18:27 +0800 Subject: [torqueusers] what is the "Time Use" from the qstat command Message-ID: Dear All, I notice that the "Time Use" filed from the output of qstat command is not the real time used since submitted. What is the "Time Use" and how is it calculated? Thanks! Junjun -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111124/2b60c26a/attachment.html From akohlmey at cmm.chem.upenn.edu Wed Nov 23 10:22:15 2011 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Wed, 23 Nov 2011 12:22:15 -0500 Subject: [torqueusers] what is the "Time Use" from the qstat command In-Reply-To: References: Message-ID: On Wed, Nov 23, 2011 at 12:18 PM, liu junjun wrote: > Dear All, > I notice that the "Time Use" filed from the output of qstat command is not > the real time used since submitted. What is the "Time Use" and how is it > calculated? it is the value of: resources_used.cput = 00:02:32 i.e. the combined CPU time of the job. you see the time since the job was started from using qstat -a in the Elap Time column. axel. > Thanks! > Junjun > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From ljjlp03 at gmail.com Wed Nov 23 18:52:21 2011 From: ljjlp03 at gmail.com (liu junjun) Date: Thu, 24 Nov 2011 09:52:21 +0800 Subject: [torqueusers] what is the "Time Use" from the qstat command In-Reply-To: References: Message-ID: I understand now. Thank you very much! Regards! Junjun On Thu, Nov 24, 2011 at 1:22 AM, Axel Kohlmeyer wrote: > On Wed, Nov 23, 2011 at 12:18 PM, liu junjun wrote: > > Dear All, > > I notice that the "Time Use" filed from the output of qstat command is > not > > the real time used since submitted. What is the "Time Use" and how is it > > calculated? > > it is the value of: > > resources_used.cput = 00:02:32 > > i.e. the combined CPU time of the job. > > you see the time since the job was started > from using qstat -a in the Elap Time column. > > axel. > > > > Thanks! > > Junjun > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111124/e627cc55/attachment-0001.html From fabien.archambault at univ-provence.fr Thu Nov 24 02:39:30 2011 From: fabien.archambault at univ-provence.fr (Fabien Archambault) Date: Thu, 24 Nov 2011 10:39:30 +0100 Subject: [torqueusers] Issue when upgrading torque from 2.5.7 to 3.0.3 Message-ID: <4ECE10D2.7070106@univ-provence.fr> Dear torque list, Yesterday I tried to update a torque installation from 2.5.7 to 3.0.3 in order, at minimum, to activate cpuset. I compiled torque on the master with the same options as before (with --enable-cpuset) and the same on a node (different architecture from the master). I also pushed all packages (torque-package-clients-linux-x86_64.sh torque-package-devel-linux-x86_64.sh torque-package-doc-linux-x86_64.sh torque-package-mom-linux-x86_64.sh) to the nodes. Then I backed-up my configuration and prayed for a successful update... In order to update I made (CentOS 5): - set all nodes offline - stop pbs_server - stop maui.d (just in case) - stop pbs_mom on all nodes - make install on the master - package --install on all nodes - start pbs_mom on all nodes - start maui.d - start pbs_server - set all nodes online First thing, all nodes were still offline. I had some messages in server_logs saying that it receives information from version 1 instead of version 2. I checked and pbs_server --version on master and pbs_mom --version on nodes were 3.0.3. What does this message meant? Also I had issues, perhaps related, that was saying impossible to communicate to port 0. It did not go through the right port. Is there in version 3.x.x special directives to add for the communication port? Seeing that it could not work well I re-installed back to the 2.5.7... Do you think it is possible to update torque to 3.x.x without issues, did I miss something or is it better to update to 2.5.9? Thank you for any reply, Fabien Archambault From knielson at adaptivecomputing.com Fri Nov 25 12:04:49 2011 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 25 Nov 2011 12:04:49 -0700 (MST) Subject: [torqueusers] torqueusers Digest, Vol 88, Issue 16 In-Reply-To: Message-ID: <8f257a9c-64b9-4a66-a529-ed23c70b8eff@mail> ----- Original Message ----- > From: "RB. Ezhilalan (Principal Physicist, CUH)" > To: torqueusers at supercluster.org > Sent: Friday, November 18, 2011 9:44:30 AM > Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16 > > Jason, > > I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this, > the > job ran on one core (linux-01) only. Then I removed the 'np' option > from > the list under the notion, the system will 'autodetect' the cores. > > Ezhilalan > > Ezhilalan Ramalingam M.Sc.,DABR., > Principal Physicist (Radiotherapy), > Medical Physics Department, > Cork University Hospital, > Wilton, Cork > Ireland > Tel. 00353 21 4922533 > Fax.00353 21 4921300 > Email: rb.ezhilalan at hse.ie If you set the server parameter auto_node_np=TRUE TORQUE will automatically detect core counts. Ken Nielson Adaptive Computing > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > torqueusers-request at supercluster.org > Sent: 18 November 2011 16:12 > To: torqueusers at supercluster.org > Subject: torqueusers Digest, Vol 88, Issue 16 > > Send torqueusers mailing list submissions to > torqueusers at supercluster.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.supercluster.org/mailman/listinfo/torqueusers > or, via email, send a message with subject or body 'help' to > torqueusers-request at supercluster.org > > You can reach the person managing the list at > torqueusers-owner at supercluster.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of torqueusers digest..." > > > Today's Topics: > > 1. Re: Parallel processing for MC code (Jason Bacon) > 2. Re: procs= not working as documented (Lance Westerhoff) > 3. Re: procs= not working as documented (Steve Crusan) > 4. Re: procs= not working as documented (Lance Westerhoff) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 18 Nov 2011 07:57:23 -0600 > From: Jason Bacon > Subject: Re: [torqueusers] Parallel processing for MC code > To: Torque Users Mailing List > Message-ID: <4EC66443.3080608 at tds.net> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > > I was only wondering if you had "np=2" in the Linux-01 entry, or if > Torque was configured to autodetect the number of cores and there > were > two. That would have explained the scheduling behavior. > > Regards, > > -J > > On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > Hi Jason, > > > > PC1 (linux-01) is a single core PC like PC2, I defined the > > server_priv/nodes file as; > > > > Linux-01 > > > > Linux-02 > > > > As you have mentioned may be resource requirement needs to be > > properly > > > set up. Do you have any suggestions? > > > > Many thanks, > > > > Ezhilalan > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > > torqueusers-request at supercluster.org > > Sent: 17 November 2011 17:20 > > To: torqueusers at supercluster.org > > Subject: torqueusers Digest, Vol 88, Issue 14 > > > > Send torqueusers mailing list submissions to > > > > torqueusers at supercluster.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > or, via email, send a message with subject or body 'help' to > > > > torqueusers-request at supercluster.org > > > > You can reach the person managing the list at > > > > torqueusers-owner at supercluster.org > > > > When replying, please edit your Subject line so it is more specific > > > > than "Re: Contents of torqueusers digest..." > > > > Today's Topics: > > > > 1. Re: Random SCP errors when transfering to/from CREAM sandbox > > > > (Christopher Samuel) > > > > 2. Re: Random SCP errors when transfering to/from CREAM sandbox > > > > (Gila Arrondo Miguel Angel) > > > > 3. Parallel processing for MC code > > > > (RB. Ezhilalan (Principal Physicist, CUH)) > > > > 4. Re: Parallel processing for MC code (Jason Bacon) > > > > 5. Re: File staging syntax (Steve Traylen) > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > > > Date: Thu, 17 Nov 2011 13:29:44 +1100 > > > > From: Christopher Samuel > > > > Subject: Re: [torqueusers] Random SCP errors when transfering > > to/from > > > > CREAM sandbox > > > > To: torqueusers at supercluster.org > > > > Message-ID: <4EC47198.1040709 at unimelb.edu.au> > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > > Hash: SHA1 > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > Many thanks for your answer. We've made sure that the > > > > > keys are okay, as well as disabling hoskeychecking to > > > > > test it. > > > > Can you try and scp as that user to see whether it > > > > complains about anything else ? > > > > It may be that it is prompting the user to accept a > > > > host key if they don't already have it. > > > > cheers, > > > > Chris > > > > - -- > > > > Christopher Samuel - Senior Systems Administrator > > > > VLSCI - Victorian Life Sciences Computation Initiative > > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > > http://www.vlsci.unimelb.edu.au/ > > > > -----BEGIN PGP SIGNATURE----- > > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > > =VPqK > > > > -----END PGP SIGNATURE----- > > > > ------------------------------ > > > > Message: 2 > > > > Date: Thu, 17 Nov 2011 07:55:50 +0000 > > > > From: "Gila Arrondo Miguel Angel" > > > > Subject: Re: [torqueusers] Random SCP errors when transfering > > to/from > > > > CREAM sandbox > > > > To: Torque Users Mailing List > > > > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch> > > > > Content-Type: text/plain; charset="us-ascii" > > > > Hi Chris, > > > > I've done that in many WNs and with different users, so I don't > > think > > that is be the issue. I've also checked for scheduled tasks that > > interact with the ssh keys, but the errors happen at random times, > > not > > > when the scheduled tasks run... :-S > > > > I'm running out of options here. > > > > Cheers, > > > > Miguel > > > > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > > > Hash: SHA1 > > > > > > > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > > > > > >> Many thanks for your answer. We've made sure that the > > > > >> keys are okay, as well as disabling hoskeychecking to > > > > >> test it. > > > > > > > > > > Can you try and scp as that user to see whether it > > > > > complains about anything else ? > > > > > > > > > > It may be that it is prompting the user to accept a > > > > > host key if they don't already have it. > > > > > > > > > > cheers, > > > > > Chris > > > > > - -- > > > > > Christopher Samuel - Senior Systems Administrator > > > > > VLSCI - Victorian Life Sciences Computation Initiative > > > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > > > http://www.vlsci.unimelb.edu.au/ > > > > > > > > > > -----BEGIN PGP SIGNATURE----- > > > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > > > > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > > > =VPqK > > > > > -----END PGP SIGNATURE----- > > > > > _______________________________________________ > > > > > torqueusers mailing list > > > > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > > > > Miguel Gila > > > > CSCS Swiss National Supercomputing Centre > > > > HPC Solutions > > > > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland > > > > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 > > > > -------------- next part -------------- > > > > A non-text attachment was scrubbed... > > > > Name: smime.p7s > > > > Type: application/pkcs7-signature > > > > Size: 3239 bytes > > > > Desc: not available > > > > Url : > > > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2 > 14ea9d6/attachment-0001.bin > > > > ------------------------------ > > > > Message: 3 > > > > Date: Thu, 17 Nov 2011 10:14:32 -0000 > > > > From: "RB. Ezhilalan (Principal Physicist, CUH)" > > > > > > Subject: [torqueusers] Parallel processing for MC code > > > > To: torqueusers at supercluster.org > > > > Message-ID: > > > > > > > > Content-Type: text/plain; charset="us-ascii" > > > > Hi All, > > > > I've been trying to set up Torque queuing system on two SUSE10.1 > > linux > > > > PCs (PIII!). > > > > Installed the linux on both PCs, exported home directory containing > > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > password > > > > less communication. All seems to be working fine. > > > > Downloaded latest version of Torque (number not handy) installed > > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > PBS 'nodes' file was created as per guidelines, PBS_SERVER and > > QUEUE > > > > attributes were set as default. > > > > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are > > free. > I > > > > am not sure whether this confirms PBS/Torque set up correctly. > > > > I was able to run an executable BEAMnrc user code in batch mode i.e > > > > using 'exb' command aliased to 'qsub' and sources a built in job > script > > > > file with option p=1 (single job). > > > > To split the jobs in to two, so that it runs in parallel on the two > PCs, > > > > option p=2 should be issued. However, what I noticed was, the job > > ran > > > > twice on the first PC (PC1) but not on both. > > > > I can't figure out what went wrong, I suspect PBS setup could have > some > > > > issues, May be I can try running the job specifically on PC2 if so > what > > > > command I need to give? > > > > I would be grateful for any advice! > > > > Kind Regards, > > > > Ezhilalan > > > > -------------- next part -------------- > > > > An HTML attachment was scrubbed... > > > > URL: > > > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0 > 6e4a798/attachment-0001.html > > > > > > ------------------------------ > > > > Message: 4 > > > > Date: Thu, 17 Nov 2011 10:18:18 -0600 > > > > From: Jason Bacon > > > > Subject: Re: [torqueusers] Parallel processing for MC code > > > > To: Torque Users Mailing List > > > > Message-ID: <4EC533CA.2000902 at tds.net> > > > > Content-Type: text/plain; charset=windows-1252; format=flowed > > > > How many cores does PC1 have? Note that Torque schedules cores, not > > > > computers, unless you specifically tell it to with resource > requirements. > > > > Regards, > > > > -J > > > > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > > > > > > > Hi All, > > > > > > > > > > I?ve been trying to set up Torque queuing system on two SUSE10.1 > linux > > > > > PCs (PIII!). > > > > > > > > > > Installed the linux on both PCs, exported home directory > > > containing > > > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > > > > > password less communication. All seems to be working fine. > > > > > > > > > > Downloaded latest version of Torque (number not handy) installed > > > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > > > > > > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and > > > QUEUE > > > > > attributes were set as default. > > > > > > > > > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are > free. > > > > > I am not sure whether this confirms PBS/Torque set up correctly. > > > > > > > > > > I was able to run an executable BEAMnrc user code in batch mode > > > i.e > > > > > using ?exb? command aliased to ?qsub? and sources a built in job > > > > > script file with option p=1 (single job). > > > > > > > > > > To split the jobs in to two, so that it runs in parallel on the > > > two > > > > > PCs, option p=2 should be issued. However, what I noticed was, > > > the > job > > > > > ran twice on the first PC (PC1) but not on both. > > > > > > > > > > I can?t figure out what went wrong, I suspect PBS setup could > > > have > > > > > some issues, May be I can try running the job specifically on PC2 > > > if > > > > > so what command I need to give? > > > > > > > > > > I would be grateful for any advice! > > > > > > > > > > Kind Regards, > > > > > > > > > > Ezhilalan > > > > > > > > > > > > > > > _______________________________________________ > > > > > torqueusers mailing list > > > > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > Jason W. Bacon > > > > jwbacon at tds.net > > > > http://personalpages.tds.net/~jwbacon > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > ------------------------------ > > > > Message: 5 > > > > Date: Thu, 17 Nov 2011 18:19:14 +0100 > > > > From: Steve Traylen > > > > Subject: Re: [torqueusers] File staging syntax > > > > To: Torque Users Mailing List > > > > Message-ID: > > > > > > > > Keywords: CERN SpamKiller Note: -50 > > > > Content-Type: text/plain; charset="ISO-8859-1" > > > > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson > > > > wrote> > > > > > Andr?, > > > > > > > > > > I have not yet had time to reproduce this. I did look through the > > change log and there are two suspects. One is in 2.5.6, a fix for > > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > > > > > > > > > That is as far as I am right now. I will try to get to this as > > > soon > > as I can. > > > > Hi Ken, > > > > Did you manage to track this down. It's currently making upgrading > > a > > pain. > > > > Steve. > > > > -- > > > > Steve Traylen > > > > ------------------------------ > > > > _______________________________________________ > > > > torqueusers mailing list > > > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > End of torqueusers Digest, Vol 88, Issue 14 > > > > ******************************************* > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Jason W. Bacon > jwbacon at tds.net > http://personalpages.tds.net/~jwbacon > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > ------------------------------ > > Message: 2 > Date: Fri, 18 Nov 2011 09:33:12 -0500 > From: Lance Westerhoff > Subject: Re: [torqueusers] procs= not working as documented > To: torqueusers at supercluster.org > Message-ID: > Content-Type: text/plain; charset=us-ascii > > The request that is placed is for procs=60. Both torque and maui see > that there are only 53 processors available and instead of letting > the > job sit in the queue and wait for all 60 processors to become > available, > it goes ahead and runs the job with what's available. Now if the user > could ask for procs=[50-60] where 50 is the minimum number of > processors > to provide and 60 is the maximum, this would be a feature. But as it > stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it > (when it shouldn't have run anyway). > > I'm actually beginning to think the problem may be related to maui. > Perhaps I'll post this same question to the maui list and see what > comes > back. > > This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > > -Lance > > > > > > Message: 3 > > Date: Thu, 17 Nov 2011 17:29:17 -0800 > > From: "Brock Palen" > > Subject: Re: [torqueusers] procs= not working as documented > > To: "Torque Users Mailing List" > > Message-ID: > > <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > > Content-Type: text/plain; charset="utf-8" > > > > Does maui only see one cpu or does mpiexec only see one cpu? > > > > > > > > Brock Palen > > (734)936-1985 > > brockp at umich.edu > > - Sent from my Palm Pre, please excuse typos > > On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > > > > > > > > Hello All- > > > > > > > > It appears that when running with the following specs, the procs= > option does not actually work as expected. > > > > > > > > ========================================== > > > > > > > > #PBS -S /bin/bash > > > > #PBS -l procs=60 > > > > #PBS -l pmem=700mb > > > > #PBS -l walltime=744:00:00 > > > > #PBS -j oe > > > > #PBS -q batch > > > > > > > > torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > > > > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete > fail in terms of the procs option and it only asks for a single CPU) > > > > > > > > ========================================== > > > > > > > > If there are fewer then 60 processors available in the cluster (in > this case there were 53 available) the job will go in an take > whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't > really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for > him > to go in. > > > > > > > > Thank you for your time! > > > > > > > > -Lance > > > > > > > > > > > > ------------------------------ > > Message: 3 > Date: Fri, 18 Nov 2011 09:47:24 -0500 > From: Steve Crusan > Subject: Re: [torqueusers] procs= not working as documented > To: Torque Users Mailing List > Message-ID: > Content-Type: text/plain; charset=us-ascii > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > > > The request that is placed is for procs=60. Both torque and maui > > see > that there are only 53 processors available and instead of letting > the > job sit in the queue and wait for all 60 processors to become > available, > it goes ahead and runs the job with what's available. Now if the user > could ask for procs=[50-60] where 50 is the minimum number of > processors > to provide and 60 is the maximum, this would be a feature. But as it > stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it > (when it shouldn't have run anyway). > > Hi Lance, > > Can you post the output of checkjob of an incorrectly > running job. Let's take a look at what Maui thinks the job is asking > for. > > Might as well add your maui.cfg file also. > > I've found in the past that procs= is troublesome... > > > > > I'm actually beginning to think the problem may be related to maui. > Perhaps I'll post this same question to the maui list and see what > comes > back. > > > > This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > > Agreed. HPC cluster job management is normally be set it and forget > it. > Anything else other than maintenance/break fixes/new features would > be > ridiculously time consuming. > > > > > -Lance > > > > > >> > >> Message: 3 > >> Date: Thu, 17 Nov 2011 17:29:17 -0800 > >> From: "Brock Palen" > >> Subject: Re: [torqueusers] procs= not working as documented > >> To: "Torque Users Mailing List" > >> Message-ID: > >> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > >> Content-Type: text/plain; charset="utf-8" > >> > >> Does maui only see one cpu or does mpiexec only see one cpu? > >> > >> > >> > >> Brock Palen > >> (734)936-1985 > >> brockp at umich.edu > >> - Sent from my Palm Pre, please excuse typos > >> On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > >> > >> > >> > >> Hello All- > >> > >> > >> > >> It appears that when running with the following specs, the procs= > option does not actually work as expected. > >> > >> > >> > >> ========================================== > >> > >> > >> > >> #PBS -S /bin/bash > >> > >> #PBS -l procs=60 > >> > >> #PBS -l pmem=700mb > >> > >> #PBS -l walltime=744:00:00 > >> > >> #PBS -j oe > >> > >> #PBS -q batch > >> > >> > >> > >> torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > >> > >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete > fail in terms of the procs option and it only asks for a single CPU) > >> > >> > >> > >> ========================================== > >> > >> > >> > >> If there are fewer then 60 processors available in the cluster (in > this case there were 53 available) the job will go in an take > whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't > really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for > him > to go in. > >> > >> > >> > >> Thank you for your time! > >> > >> > >> > >> -Lance > >> > >> > >> > >> > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > =AGW7 > -----END PGP SIGNATURE----- > > > ------------------------------ > > Message: 4 > Date: Fri, 18 Nov 2011 11:12:06 -0500 > From: Lance Westerhoff > Subject: Re: [torqueusers] procs= not working as documented > To: Torque Users Mailing List > Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com> > Content-Type: text/plain; charset=us-ascii > > > Hi Steve- > > Here you go. Here is the top few lines of the job script. I have then > provided the output you requested long with the maui.cfg. If you need > anything further, certainly please let me know. > > Thanks for your help! > > =============== > > + head job.pbs > > #!/bin/bash > #PBS -S /bin/bash > #PBS -l procs=100 > #PBS -l pmem=700mb > #PBS -l walltime=744:00:00 > #PBS -j oe > #PBS -q batch > > Report run on Fri Nov 18 10:49:38 EST 2011 > + pbsnodes --version > version: 3.0.2 > + diagnose --version > maui client version 3.2.6p21 > + checkjob 371010 > > > checking job 371010 > > State: Running > Creds: user:josh group:games class:batch qos:DEFAULT > WallTime: 00:02:35 of 31:00:00:00 > SubmitTime: Fri Nov 18 10:46:33 > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > StartTime: Fri Nov 18 10:46:34 > Total Tasks: 1 > > Req[0] TaskCount: 26 Partition: DEFAULT > Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Dedicated Resources Per Task: PROCS: 1 MEM: 700M > NodeCount: 10 > Allocated Nodes: > [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] > [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] > [compute-0-13:2][compute-0-14:2] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: > 31:00:00:00) > PE: 26.00 StartPriority: 4716 > > + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" > SERVERHOST gondor > ADMIN1 maui root > ADMIN3 ALL > RMCFG[base] TYPE=PBS > AMCFG[bank] TYPE=NONE > RMPOLLINTERVAL 00:01:00 > SERVERPORT 42559 > SERVERMODE NORMAL > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > QUEUETIMEWEIGHT 1 > FSPOLICY DEDICATEDPS > FSDEPTH 7 > FSINTERVAL 86400 > FSDECAY 0.50 > FSWEIGHT 200 > FSUSERWEIGHT 1 > FSGROUPWEIGHT 1000 > FSQOSWEIGHT 1000 > FSACCOUNTWEIGHT 1 > FSCLASSWEIGHT 1000 > USERWEIGHT 4 > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEALLOCATIONPOLICY MINRESOURCE > RESERVATIONDEPTH 8 > MAXJOBPERUSERPOLICY OFF > MAXJOBPERUSERCOUNT 8 > MAXPROCPERUSERPOLICY OFF > MAXPROCPERUSERCOUNT 256 > MAXPROCSECONDPERUSERPOLICY OFF > MAXPROCSECONDPERUSERCOUNT 36864000 > MAXJOBQUEUEDPERUSERPOLICY OFF > MAXJOBQUEUEDPERUSERCOUNT 2 > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SHARED > JOBMAXOVERRUN 99:00:00:00 > DEFERCOUNT 8192 > DEFERTIME 0 > CLASSCFG[developer] FSTARGET=40.00+ > CLASSCFG[lowprio] PRIORITY=-1000 > SRCFG[developer] CLASSLIST=developer > SRCFG[developer] ACCESS=dedicated > SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri > SRCFG[developer] STARTTIME=08:00:00 > SRCFG[developer] ENDTIME=18:00:00 > SRCFG[developer] TIMELIMIT=2:00:00 > SRCFG[developer] RESOURCES=PROCS(8) > USERCFG[DEFAULT] FSTARGET=100.0 > > =============== > > -Lance > > > On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > > > >> The request that is placed is for procs=60. Both torque and maui > >> see > that there are only 53 processors available and instead of letting > the > job sit in the queue and wait for all 60 processors to become > available, > it goes ahead and runs the job with what's available. Now if the user > could ask for procs=[50-60] where 50 is the minimum number of > processors > to provide and 60 is the maximum, this would be a feature. But as it > stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it > (when it shouldn't have run anyway). > > > > Hi Lance, > > > > Can you post the output of checkjob of an incorrectly > running job. Let's take a look at what Maui thinks the job is asking > for. > > > > Might as well add your maui.cfg file also. > > > > I've found in the past that procs= is troublesome... > > > >> > >> I'm actually beginning to think the problem may be related to > >> maui. > Perhaps I'll post this same question to the maui list and see what > comes > back. > >> > >> This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > > > > Agreed. HPC cluster job management is normally be set it and forget > it. Anything else other than maintenance/break fixes/new features > would > be ridiculously time consuming. > > > >> > >> -Lance > >> > >> > >>> > >>> Message: 3 > >>> Date: Thu, 17 Nov 2011 17:29:17 -0800 > >>> From: "Brock Palen" > >>> Subject: Re: [torqueusers] procs= not working as documented > >>> To: "Torque Users Mailing List" > >>> Message-ID: > >>> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > >>> Content-Type: text/plain; charset="utf-8" > >>> > >>> Does maui only see one cpu or does mpiexec only see one cpu? > >>> > >>> > >>> > >>> Brock Palen > >>> (734)936-1985 > >>> brockp at umich.edu > >>> - Sent from my Palm Pre, please excuse typos > >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > >>> > >>> > >>> > >>> Hello All- > >>> > >>> > >>> > >>> It appears that when running with the following specs, the procs= > option does not actually work as expected. > >>> > >>> > >>> > >>> ========================================== > >>> > >>> > >>> > >>> #PBS -S /bin/bash > >>> > >>> #PBS -l procs=60 > >>> > >>> #PBS -l pmem=700mb > >>> > >>> #PBS -l walltime=744:00:00 > >>> > >>> #PBS -j oe > >>> > >>> #PBS -q batch > >>> > >>> > >>> > >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > >>> > >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a > >>> complete > fail in terms of the procs option and it only asks for a single CPU) > >>> > >>> > >>> > >>> ========================================== > >>> > >>> > >>> > >>> If there are fewer then 60 processors available in the cluster > >>> (in > this case there were 53 available) the job will go in an take > whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't > really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for > him > to go in. > >>> > >>> > >>> > >>> Thank you for your time! > >>> > >>> > >>> > >>> -Lance > >>> > >>> > >>> > >>> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ---------------------- > > Steve Crusan > > System Administrator > > Center for Research Computing > > University of Rochester > > https://www.crc.rochester.edu/ > > > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > > Comment: GPGTools - http://gpgtools.org > > > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > > =AGW7 > > -----END PGP SIGNATURE----- > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > End of torqueusers Digest, Vol 88, Issue 16 > ******************************************* > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From RB.Ezhilalan at hse.ie Mon Nov 28 02:11:37 2011 From: RB.Ezhilalan at hse.ie (RB. Ezhilalan (Principal Physicist, CUH)) Date: Mon, 28 Nov 2011 09:11:37 -0000 Subject: [torqueusers] torqueusers Digest, Vol 88, Issue 26 In-Reply-To: References: Message-ID: Hi Ken, Thanks for your suggestion, currently I set ...ncpus=2 and now the job can able to run on the two PCs. However I'll try to use the setting suggested by you when we set up a small cluster using number of PCs. Regards, Ezhilalan Ezhilalan Ramalingam M.Sc.,DABR., Principal Physicist (Radiotherapy), Medical Physics Department, Cork University Hospital, Wilton, Cork Ireland Tel. 00353 21 4922533 Fax.00353 21 4921300 Email: rb.ezhilalan at hse.ie -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of torqueusers-request at supercluster.org Sent: 25 November 2011 19:05 To: torqueusers at supercluster.org Subject: torqueusers Digest, Vol 88, Issue 26 Send torqueusers mailing list submissions to torqueusers at supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request at supercluster.org You can reach the person managing the list at torqueusers-owner at supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Issue when upgrading torque from 2.5.7 to 3.0.3 (Fabien Archambault) 2. Re: torqueusers Digest, Vol 88, Issue 16 (Ken Nielson) ---------------------------------------------------------------------- Message: 1 Date: Thu, 24 Nov 2011 10:39:30 +0100 From: Fabien Archambault Subject: [torqueusers] Issue when upgrading torque from 2.5.7 to 3.0.3 To: torqueusers at supercluster.org Message-ID: <4ECE10D2.7070106 at univ-provence.fr> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Dear torque list, Yesterday I tried to update a torque installation from 2.5.7 to 3.0.3 in order, at minimum, to activate cpuset. I compiled torque on the master with the same options as before (with --enable-cpuset) and the same on a node (different architecture from the master). I also pushed all packages (torque-package-clients-linux-x86_64.sh torque-package-devel-linux-x86_64.sh torque-package-doc-linux-x86_64.sh torque-package-mom-linux-x86_64.sh) to the nodes. Then I backed-up my configuration and prayed for a successful update... In order to update I made (CentOS 5): - set all nodes offline - stop pbs_server - stop maui.d (just in case) - stop pbs_mom on all nodes - make install on the master - package --install on all nodes - start pbs_mom on all nodes - start maui.d - start pbs_server - set all nodes online First thing, all nodes were still offline. I had some messages in server_logs saying that it receives information from version 1 instead of version 2. I checked and pbs_server --version on master and pbs_mom --version on nodes were 3.0.3. What does this message meant? Also I had issues, perhaps related, that was saying impossible to communicate to port 0. It did not go through the right port. Is there in version 3.x.x special directives to add for the communication port? Seeing that it could not work well I re-installed back to the 2.5.7... Do you think it is possible to update torque to 3.x.x without issues, did I miss something or is it better to update to 2.5.9? Thank you for any reply, Fabien Archambault ------------------------------ Message: 2 Date: Fri, 25 Nov 2011 12:04:49 -0700 (MST) From: Ken Nielson Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16 To: Torque Users Mailing List Message-ID: <8f257a9c-64b9-4a66-a529-ed23c70b8eff at mail> Content-Type: text/plain; charset=utf-8 ----- Original Message ----- > From: "RB. Ezhilalan (Principal Physicist, CUH)" > To: torqueusers at supercluster.org > Sent: Friday, November 18, 2011 9:44:30 AM > Subject: Re: [torqueusers] torqueusers Digest, Vol 88, Issue 16 > > Jason, > > I had linux-01 np=1, linux-02 np=1 in the nodes file, despite this, > the > job ran on one core (linux-01) only. Then I removed the 'np' option > from > the list under the notion, the system will 'autodetect' the cores. > > Ezhilalan > > Ezhilalan Ramalingam M.Sc.,DABR., > Principal Physicist (Radiotherapy), > Medical Physics Department, > Cork University Hospital, > Wilton, Cork > Ireland > Tel. 00353 21 4922533 > Fax.00353 21 4921300 > Email: rb.ezhilalan at hse.ie If you set the server parameter auto_node_np=TRUE TORQUE will automatically detect core counts. Ken Nielson Adaptive Computing > > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > torqueusers-request at supercluster.org > Sent: 18 November 2011 16:12 > To: torqueusers at supercluster.org > Subject: torqueusers Digest, Vol 88, Issue 16 > > Send torqueusers mailing list submissions to > torqueusers at supercluster.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.supercluster.org/mailman/listinfo/torqueusers > or, via email, send a message with subject or body 'help' to > torqueusers-request at supercluster.org > > You can reach the person managing the list at > torqueusers-owner at supercluster.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of torqueusers digest..." > > > Today's Topics: > > 1. Re: Parallel processing for MC code (Jason Bacon) > 2. Re: procs= not working as documented (Lance Westerhoff) > 3. Re: procs= not working as documented (Steve Crusan) > 4. Re: procs= not working as documented (Lance Westerhoff) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 18 Nov 2011 07:57:23 -0600 > From: Jason Bacon > Subject: Re: [torqueusers] Parallel processing for MC code > To: Torque Users Mailing List > Message-ID: <4EC66443.3080608 at tds.net> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > > I was only wondering if you had "np=2" in the Linux-01 entry, or if > Torque was configured to autodetect the number of cores and there > were > two. That would have explained the scheduling behavior. > > Regards, > > -J > > On 11/18/11 03:48, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > Hi Jason, > > > > PC1 (linux-01) is a single core PC like PC2, I defined the > > server_priv/nodes file as; > > > > Linux-01 > > > > Linux-02 > > > > As you have mentioned may be resource requirement needs to be > > properly > > > set up. Do you have any suggestions? > > > > Many thanks, > > > > Ezhilalan > > > > -----Original Message----- > > From: torqueusers-bounces at supercluster.org > > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of > > torqueusers-request at supercluster.org > > Sent: 17 November 2011 17:20 > > To: torqueusers at supercluster.org > > Subject: torqueusers Digest, Vol 88, Issue 14 > > > > Send torqueusers mailing list submissions to > > > > torqueusers at supercluster.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > or, via email, send a message with subject or body 'help' to > > > > torqueusers-request at supercluster.org > > > > You can reach the person managing the list at > > > > torqueusers-owner at supercluster.org > > > > When replying, please edit your Subject line so it is more specific > > > > than "Re: Contents of torqueusers digest..." > > > > Today's Topics: > > > > 1. Re: Random SCP errors when transfering to/from CREAM sandbox > > > > (Christopher Samuel) > > > > 2. Re: Random SCP errors when transfering to/from CREAM sandbox > > > > (Gila Arrondo Miguel Angel) > > > > 3. Parallel processing for MC code > > > > (RB. Ezhilalan (Principal Physicist, CUH)) > > > > 4. Re: Parallel processing for MC code (Jason Bacon) > > > > 5. Re: File staging syntax (Steve Traylen) > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > > > Date: Thu, 17 Nov 2011 13:29:44 +1100 > > > > From: Christopher Samuel > > > > Subject: Re: [torqueusers] Random SCP errors when transfering > > to/from > > > > CREAM sandbox > > > > To: torqueusers at supercluster.org > > > > Message-ID: <4EC47198.1040709 at unimelb.edu.au> > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > > Hash: SHA1 > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > Many thanks for your answer. We've made sure that the > > > > > keys are okay, as well as disabling hoskeychecking to > > > > > test it. > > > > Can you try and scp as that user to see whether it > > > > complains about anything else ? > > > > It may be that it is prompting the user to accept a > > > > host key if they don't already have it. > > > > cheers, > > > > Chris > > > > - -- > > > > Christopher Samuel - Senior Systems Administrator > > > > VLSCI - Victorian Life Sciences Computation Initiative > > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > > http://www.vlsci.unimelb.edu.au/ > > > > -----BEGIN PGP SIGNATURE----- > > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > > =VPqK > > > > -----END PGP SIGNATURE----- > > > > ------------------------------ > > > > Message: 2 > > > > Date: Thu, 17 Nov 2011 07:55:50 +0000 > > > > From: "Gila Arrondo Miguel Angel" > > > > Subject: Re: [torqueusers] Random SCP errors when transfering > > to/from > > > > CREAM sandbox > > > > To: Torque Users Mailing List > > > > Message-ID: <36DEB2B3-4C2B-4B95-8CE6-DFB1363A71EE at cscs.ch> > > > > Content-Type: text/plain; charset="us-ascii" > > > > Hi Chris, > > > > I've done that in many WNs and with different users, so I don't > > think > > that is be the issue. I've also checked for scheduled tasks that > > interact with the ssh keys, but the errors happen at random times, > > not > > > when the scheduled tasks run... :-S > > > > I'm running out of options here. > > > > Cheers, > > > > Miguel > > > > On Nov 17, 2011, at 3:29 AM, Christopher Samuel wrote: > > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > > > Hash: SHA1 > > > > > > > > > > On 17/11/11 03:24, Gila Arrondo Miguel Angel wrote: > > > > > > > > > >> Many thanks for your answer. We've made sure that the > > > > >> keys are okay, as well as disabling hoskeychecking to > > > > >> test it. > > > > > > > > > > Can you try and scp as that user to see whether it > > > > > complains about anything else ? > > > > > > > > > > It may be that it is prompting the user to accept a > > > > > host key if they don't already have it. > > > > > > > > > > cheers, > > > > > Chris > > > > > - -- > > > > > Christopher Samuel - Senior Systems Administrator > > > > > VLSCI - Victorian Life Sciences Computation Initiative > > > > > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > > > > > http://www.vlsci.unimelb.edu.au/ > > > > > > > > > > -----BEGIN PGP SIGNATURE----- > > > > > Version: GnuPG v1.4.11 (GNU/Linux) > > > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > > > > > > > iEYEARECAAYFAk7EcZgACgkQO2KABBYQAh9K+ACfeFLepTpowIXW9CiK2ECr1IdW > > > > > sgcAn0cIHr3JnJORTY4g2a/PcA/11fNS > > > > > =VPqK > > > > > -----END PGP SIGNATURE----- > > > > > _______________________________________________ > > > > > torqueusers mailing list > > > > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > > > > Miguel Gila > > > > CSCS Swiss National Supercomputing Centre > > > > HPC Solutions > > > > Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland > > > > miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22 > > > > -------------- next part -------------- > > > > A non-text attachment was scrubbed... > > > > Name: smime.p7s > > > > Type: application/pkcs7-signature > > > > Size: 3239 bytes > > > > Desc: not available > > > > Url : > > > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/2 > 14ea9d6/attachment-0001.bin > > > > ------------------------------ > > > > Message: 3 > > > > Date: Thu, 17 Nov 2011 10:14:32 -0000 > > > > From: "RB. Ezhilalan (Principal Physicist, CUH)" > > > > > > Subject: [torqueusers] Parallel processing for MC code > > > > To: torqueusers at supercluster.org > > > > Message-ID: > > > > > > > > Content-Type: text/plain; charset="us-ascii" > > > > Hi All, > > > > I've been trying to set up Torque queuing system on two SUSE10.1 > > linux > > > > PCs (PIII!). > > > > Installed the linux on both PCs, exported home directory containing > > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > password > > > > less communication. All seems to be working fine. > > > > Downloaded latest version of Torque (number not handy) installed > > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > PBS 'nodes' file was created as per guidelines, PBS_SERVER and > > QUEUE > > > > attributes were set as default. > > > > Pbsnodes -a command displays- two nodes (PC1 & PC2 and they are > > free. > I > > > > am not sure whether this confirms PBS/Torque set up correctly. > > > > I was able to run an executable BEAMnrc user code in batch mode i.e > > > > using 'exb' command aliased to 'qsub' and sources a built in job > script > > > > file with option p=1 (single job). > > > > To split the jobs in to two, so that it runs in parallel on the two > PCs, > > > > option p=2 should be issued. However, what I noticed was, the job > > ran > > > > twice on the first PC (PC1) but not on both. > > > > I can't figure out what went wrong, I suspect PBS setup could have > some > > > > issues, May be I can try running the job specifically on PC2 if so > what > > > > command I need to give? > > > > I would be grateful for any advice! > > > > Kind Regards, > > > > Ezhilalan > > > > -------------- next part -------------- > > > > An HTML attachment was scrubbed... > > > > URL: > > > http://www.supercluster.org/pipermail/torqueusers/attachments/20111117/0 > 6e4a798/attachment-0001.html > > > > > > ------------------------------ > > > > Message: 4 > > > > Date: Thu, 17 Nov 2011 10:18:18 -0600 > > > > From: Jason Bacon > > > > Subject: Re: [torqueusers] Parallel processing for MC code > > > > To: Torque Users Mailing List > > > > Message-ID: <4EC533CA.2000902 at tds.net> > > > > Content-Type: text/plain; charset=windows-1252; format=flowed > > > > How many cores does PC1 have? Note that Torque schedules cores, not > > > > computers, unless you specifically tell it to with resource > requirements. > > > > Regards, > > > > -J > > > > On 11/17/11 04:14, RB. Ezhilalan (Principal Physicist, CUH) wrote: > > > > > > > > > > Hi All, > > > > > > > > > > I?ve been trying to set up Torque queuing system on two SUSE10.1 > linux > > > > > PCs (PIII!). > > > > > > > > > > Installed the linux on both PCs, exported home directory > > > containing > > > > > BEAMnrc montecarlo code from PC1 to PC2 via NFS and set up SSH > > > > > password less communication. All seems to be working fine. > > > > > > > > > > Downloaded latest version of Torque (number not handy) installed > > > > > PBS_SERVER, PBS_MOM & PBS_SCHED on PC1 and PBS_MOM on PC2. > > > > > > > > > > PBS ?nodes? file was created as per guidelines, PBS_SERVER and > > > QUEUE > > > > > attributes were set as default. > > > > > > > > > > Pbsnodes ?a command displays- two nodes (PC1 & PC2 and they are > free. > > > > > I am not sure whether this confirms PBS/Torque set up correctly. > > > > > > > > > > I was able to run an executable BEAMnrc user code in batch mode > > > i.e > > > > > using ?exb? command aliased to ?qsub? and sources a built in job > > > > > script file with option p=1 (single job). > > > > > > > > > > To split the jobs in to two, so that it runs in parallel on the > > > two > > > > > PCs, option p=2 should be issued. However, what I noticed was, > > > the > job > > > > > ran twice on the first PC (PC1) but not on both. > > > > > > > > > > I can?t figure out what went wrong, I suspect PBS setup could > > > have > > > > > some issues, May be I can try running the job specifically on PC2 > > > if > > > > > so what command I need to give? > > > > > > > > > > I would be grateful for any advice! > > > > > > > > > > Kind Regards, > > > > > > > > > > Ezhilalan > > > > > > > > > > > > > > > _______________________________________________ > > > > > torqueusers mailing list > > > > > torqueusers at supercluster.org > > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > Jason W. Bacon > > > > jwbacon at tds.net > > > > http://personalpages.tds.net/~jwbacon > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > ------------------------------ > > > > Message: 5 > > > > Date: Thu, 17 Nov 2011 18:19:14 +0100 > > > > From: Steve Traylen > > > > Subject: Re: [torqueusers] File staging syntax > > > > To: Torque Users Mailing List > > > > Message-ID: > > > > > > > > Keywords: CERN SpamKiller Note: -50 > > > > Content-Type: text/plain; charset="ISO-8859-1" > > > > On Thu, Sep 29, 2011 at 4:59 PM, Ken Nielson > > > > wrote> > > > > > Andr?, > > > > > > > > > > I have not yet had time to reproduce this. I did look through the > > change log and there are two suspects. One is in 2.5.6, a fix for > > Bugzilla 115 and the other is in 2.5.8, a fix for Bugzilla 133. > > > > > > > > > > That is as far as I am right now. I will try to get to this as > > > soon > > as I can. > > > > Hi Ken, > > > > Did you manage to track this down. It's currently making upgrading > > a > > pain. > > > > Steve. > > > > -- > > > > Steve Traylen > > > > ------------------------------ > > > > _______________________________________________ > > > > torqueusers mailing list > > > > torqueusers at supercluster.org > > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > End of torqueusers Digest, Vol 88, Issue 14 > > > > ******************************************* > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Jason W. Bacon > jwbacon at tds.net > http://personalpages.tds.net/~jwbacon > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > ------------------------------ > > Message: 2 > Date: Fri, 18 Nov 2011 09:33:12 -0500 > From: Lance Westerhoff > Subject: Re: [torqueusers] procs= not working as documented > To: torqueusers at supercluster.org > Message-ID: > Content-Type: text/plain; charset=us-ascii > > The request that is placed is for procs=60. Both torque and maui see > that there are only 53 processors available and instead of letting > the > job sit in the queue and wait for all 60 processors to become > available, > it goes ahead and runs the job with what's available. Now if the user > could ask for procs=[50-60] where 50 is the minimum number of > processors > to provide and 60 is the maximum, this would be a feature. But as it > stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it > (when it shouldn't have run anyway). > > I'm actually beginning to think the problem may be related to maui. > Perhaps I'll post this same question to the maui list and see what > comes > back. > > This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > > -Lance > > > > > > Message: 3 > > Date: Thu, 17 Nov 2011 17:29:17 -0800 > > From: "Brock Palen" > > Subject: Re: [torqueusers] procs= not working as documented > > To: "Torque Users Mailing List" > > Message-ID: > > <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > > Content-Type: text/plain; charset="utf-8" > > > > Does maui only see one cpu or does mpiexec only see one cpu? > > > > > > > > Brock Palen > > (734)936-1985 > > brockp at umich.edu > > - Sent from my Palm Pre, please excuse typos > > On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > > > > > > > > Hello All- > > > > > > > > It appears that when running with the following specs, the procs= > option does not actually work as expected. > > > > > > > > ========================================== > > > > > > > > #PBS -S /bin/bash > > > > #PBS -l procs=60 > > > > #PBS -l pmem=700mb > > > > #PBS -l walltime=744:00:00 > > > > #PBS -j oe > > > > #PBS -q batch > > > > > > > > torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > > > > maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete > fail in terms of the procs option and it only asks for a single CPU) > > > > > > > > ========================================== > > > > > > > > If there are fewer then 60 processors available in the cluster (in > this case there were 53 available) the job will go in an take > whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't > really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for > him > to go in. > > > > > > > > Thank you for your time! > > > > > > > > -Lance > > > > > > > > > > > > ------------------------------ > > Message: 3 > Date: Fri, 18 Nov 2011 09:47:24 -0500 > From: Steve Crusan > Subject: Re: [torqueusers] procs= not working as documented > To: Torque Users Mailing List > Message-ID: > Content-Type: text/plain; charset=us-ascii > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > > > The request that is placed is for procs=60. Both torque and maui > > see > that there are only 53 processors available and instead of letting > the > job sit in the queue and wait for all 60 processors to become > available, > it goes ahead and runs the job with what's available. Now if the user > could ask for procs=[50-60] where 50 is the minimum number of > processors > to provide and 60 is the maximum, this would be a feature. But as it > stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it > (when it shouldn't have run anyway). > > Hi Lance, > > Can you post the output of checkjob of an incorrectly > running job. Let's take a look at what Maui thinks the job is asking > for. > > Might as well add your maui.cfg file also. > > I've found in the past that procs= is troublesome... > > > > > I'm actually beginning to think the problem may be related to maui. > Perhaps I'll post this same question to the maui list and see what > comes > back. > > > > This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > > Agreed. HPC cluster job management is normally be set it and forget > it. > Anything else other than maintenance/break fixes/new features would > be > ridiculously time consuming. > > > > > -Lance > > > > > >> > >> Message: 3 > >> Date: Thu, 17 Nov 2011 17:29:17 -0800 > >> From: "Brock Palen" > >> Subject: Re: [torqueusers] procs= not working as documented > >> To: "Torque Users Mailing List" > >> Message-ID: > >> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > >> Content-Type: text/plain; charset="utf-8" > >> > >> Does maui only see one cpu or does mpiexec only see one cpu? > >> > >> > >> > >> Brock Palen > >> (734)936-1985 > >> brockp at umich.edu > >> - Sent from my Palm Pre, please excuse typos > >> On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > >> > >> > >> > >> Hello All- > >> > >> > >> > >> It appears that when running with the following specs, the procs= > option does not actually work as expected. > >> > >> > >> > >> ========================================== > >> > >> > >> > >> #PBS -S /bin/bash > >> > >> #PBS -l procs=60 > >> > >> #PBS -l pmem=700mb > >> > >> #PBS -l walltime=744:00:00 > >> > >> #PBS -j oe > >> > >> #PBS -q batch > >> > >> > >> > >> torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > >> > >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete > fail in terms of the procs option and it only asks for a single CPU) > >> > >> > >> > >> ========================================== > >> > >> > >> > >> If there are fewer then 60 processors available in the cluster (in > this case there were 53 available) the job will go in an take > whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't > really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for > him > to go in. > >> > >> > >> > >> Thank you for your time! > >> > >> > >> > >> -Lance > >> > >> > >> > >> > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > =AGW7 > -----END PGP SIGNATURE----- > > > ------------------------------ > > Message: 4 > Date: Fri, 18 Nov 2011 11:12:06 -0500 > From: Lance Westerhoff > Subject: Re: [torqueusers] procs= not working as documented > To: Torque Users Mailing List > Message-ID: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE at quantumbioinc.com> > Content-Type: text/plain; charset=us-ascii > > > Hi Steve- > > Here you go. Here is the top few lines of the job script. I have then > provided the output you requested long with the maui.cfg. If you need > anything further, certainly please let me know. > > Thanks for your help! > > =============== > > + head job.pbs > > #!/bin/bash > #PBS -S /bin/bash > #PBS -l procs=100 > #PBS -l pmem=700mb > #PBS -l walltime=744:00:00 > #PBS -j oe > #PBS -q batch > > Report run on Fri Nov 18 10:49:38 EST 2011 > + pbsnodes --version > version: 3.0.2 > + diagnose --version > maui client version 3.2.6p21 > + checkjob 371010 > > > checking job 371010 > > State: Running > Creds: user:josh group:games class:batch qos:DEFAULT > WallTime: 00:02:35 of 31:00:00:00 > SubmitTime: Fri Nov 18 10:46:33 > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > StartTime: Fri Nov 18 10:46:34 > Total Tasks: 1 > > Req[0] TaskCount: 26 Partition: DEFAULT > Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Dedicated Resources Per Task: PROCS: 1 MEM: 700M > NodeCount: 10 > Allocated Nodes: > [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] > [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] > [compute-0-13:2][compute-0-14:2] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: > 31:00:00:00) > PE: 26.00 StartPriority: 4716 > > + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" > SERVERHOST gondor > ADMIN1 maui root > ADMIN3 ALL > RMCFG[base] TYPE=PBS > AMCFG[bank] TYPE=NONE > RMPOLLINTERVAL 00:01:00 > SERVERPORT 42559 > SERVERMODE NORMAL > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > QUEUETIMEWEIGHT 1 > FSPOLICY DEDICATEDPS > FSDEPTH 7 > FSINTERVAL 86400 > FSDECAY 0.50 > FSWEIGHT 200 > FSUSERWEIGHT 1 > FSGROUPWEIGHT 1000 > FSQOSWEIGHT 1000 > FSACCOUNTWEIGHT 1 > FSCLASSWEIGHT 1000 > USERWEIGHT 4 > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEALLOCATIONPOLICY MINRESOURCE > RESERVATIONDEPTH 8 > MAXJOBPERUSERPOLICY OFF > MAXJOBPERUSERCOUNT 8 > MAXPROCPERUSERPOLICY OFF > MAXPROCPERUSERCOUNT 256 > MAXPROCSECONDPERUSERPOLICY OFF > MAXPROCSECONDPERUSERCOUNT 36864000 > MAXJOBQUEUEDPERUSERPOLICY OFF > MAXJOBQUEUEDPERUSERCOUNT 2 > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SHARED > JOBMAXOVERRUN 99:00:00:00 > DEFERCOUNT 8192 > DEFERTIME 0 > CLASSCFG[developer] FSTARGET=40.00+ > CLASSCFG[lowprio] PRIORITY=-1000 > SRCFG[developer] CLASSLIST=developer > SRCFG[developer] ACCESS=dedicated > SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri > SRCFG[developer] STARTTIME=08:00:00 > SRCFG[developer] ENDTIME=18:00:00 > SRCFG[developer] TIMELIMIT=2:00:00 > SRCFG[developer] RESOURCES=PROCS(8) > USERCFG[DEFAULT] FSTARGET=100.0 > > =============== > > -Lance > > > On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > > > On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > > > >> The request that is placed is for procs=60. Both torque and maui > >> see > that there are only 53 processors available and instead of letting > the > job sit in the queue and wait for all 60 processors to become > available, > it goes ahead and runs the job with what's available. Now if the user > could ask for procs=[50-60] where 50 is the minimum number of > processors > to provide and 60 is the maximum, this would be a feature. But as it > stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it > (when it shouldn't have run anyway). > > > > Hi Lance, > > > > Can you post the output of checkjob of an incorrectly > running job. Let's take a look at what Maui thinks the job is asking > for. > > > > Might as well add your maui.cfg file also. > > > > I've found in the past that procs= is troublesome... > > > >> > >> I'm actually beginning to think the problem may be related to > >> maui. > Perhaps I'll post this same question to the maui list and see what > comes > back. > >> > >> This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > > > > Agreed. HPC cluster job management is normally be set it and forget > it. Anything else other than maintenance/break fixes/new features > would > be ridiculously time consuming. > > > >> > >> -Lance > >> > >> > >>> > >>> Message: 3 > >>> Date: Thu, 17 Nov 2011 17:29:17 -0800 > >>> From: "Brock Palen" > >>> Subject: Re: [torqueusers] procs= not working as documented > >>> To: "Torque Users Mailing List" > >>> Message-ID: > >>> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > >>> Content-Type: text/plain; charset="utf-8" > >>> > >>> Does maui only see one cpu or does mpiexec only see one cpu? > >>> > >>> > >>> > >>> Brock Palen > >>> (734)936-1985 > >>> brockp at umich.edu > >>> - Sent from my Palm Pre, please excuse typos > >>> On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > >>> > >>> > >>> > >>> Hello All- > >>> > >>> > >>> > >>> It appears that when running with the following specs, the procs= > option does not actually work as expected. > >>> > >>> > >>> > >>> ========================================== > >>> > >>> > >>> > >>> #PBS -S /bin/bash > >>> > >>> #PBS -l procs=60 > >>> > >>> #PBS -l pmem=700mb > >>> > >>> #PBS -l walltime=744:00:00 > >>> > >>> #PBS -j oe > >>> > >>> #PBS -q batch > >>> > >>> > >>> > >>> torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > >>> > >>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a > >>> complete > fail in terms of the procs option and it only asks for a single CPU) > >>> > >>> > >>> > >>> ========================================== > >>> > >>> > >>> > >>> If there are fewer then 60 processors available in the cluster > >>> (in > this case there were 53 available) the job will go in an take > whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't > really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for > him > to go in. > >>> > >>> > >>> > >>> Thank you for your time! > >>> > >>> > >>> > >>> -Lance > >>> > >>> > >>> > >>> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ---------------------- > > Steve Crusan > > System Administrator > > Center for Research Computing > > University of Rochester > > https://www.crc.rochester.edu/ > > > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > > Comment: GPGTools - http://gpgtools.org > > > > iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > > bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > > cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > > tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > > JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > > Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > > =AGW7 > > -----END PGP SIGNATURE----- > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > ------------------------------ > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > End of torqueusers Digest, Vol 88, Issue 16 > ******************************************* > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > ------------------------------ _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 88, Issue 26 ******************************************* From kenneth at sdsc.edu Mon Nov 28 11:04:53 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Mon, 28 Nov 2011 10:04:53 -0800 (PST) Subject: [torqueusers] UC prologue.parallel? In-Reply-To: <20111111142402.786B383A802B@mail.adaptivecomputing.com> References: <20111111142402.786B383A802B@mail.adaptivecomputing.com> Message-ID: I tried changing to mode 500. Also, tried to touch a file in /tmp to verify that the script ran. No dice. Looks like this never gets called. I'm testing on a single node. Since it's both Mother Superior and a sister, does that mean prologue.parallel is not called? I did look through the Torque source, and I couldn't see where it gets called, unlike prologue. On Fri, 11 Nov 2011, Rushton Martin wrote: > Date: Fri, 11 Nov 2011 14:24:36 -0000 > From: Rushton Martin > Reply-To: Torque Users Mailing List > To: Torque Users Mailing List > Subject: Re: [torqueusers] UC prologue.parallel? > > I had some fun with this a while back. prologue.parallel runs on a > sister node, not on the Mother Superior. Its stdout and stderr are NOT > connected to the job's stdout and stderr, so any output is lost. You > need to route the output to a file and then look at the file to see what > is happening. Remember also that prologue.parallel is running as root, > you need prologue.user.parallel if you want to run as the user. Watch > the permissions also, 500 for the former and 555 for the latter. If any > root run pro/epi has other:w, then it will not run. > > > Martin Rushton > HPC System Manager, Weapons Technologies > Tel: 01959 514777, Mobile: 07939 219057 > email: jmrushton at QinetiQ.com > www.QinetiQ.com > QinetiQ - Delivering customer-focused solutions > > Please consider the environment before printing this email. > -----Original Message----- > From: torqueusers-bounces at supercluster.org > [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Kenneth > Yoshimoto > Sent: 10 November 2011 23:56 > To: torqueusers at supercluster.org > Subject: [torqueusers] prologue.parallel? > > > I'm running Torque 3.0.2 > I'm trying to use a prologue.parallel, but it doesn't seem to ever get > invoked. Tried looking in Torque src, but I don't see where it gets > started. Where is the code that starts it? > > Thanks, > Kenneth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > This email and any attachments to it may be confidential and are > intended solely for the use of the individual to whom it is > addressed. If you are not the intended recipient of this email, > you must neither take any action based upon its contents, nor > copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. QinetiQ may > monitor email traffic data and also the content of email for > the purposes of security. QinetiQ Limited (Registered in England > & Wales: Company Number: 3796233) Registered office: Cody Technology > Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From toth at fi.muni.cz Mon Nov 28 11:45:53 2011 From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=) Date: Mon, 28 Nov 2011 19:45:53 +0100 Subject: [torqueusers] UC prologue.parallel? In-Reply-To: References: <20111111142402.786B383A802B@mail.adaptivecomputing.com> Message-ID: <4ED3D6E1.7000607@fi.muni.cz> > I tried changing to mode 500. Also, tried to touch > a file in /tmp to verify that the script ran. No dice. > Looks like this never gets called. I'm testing on a > single node. Since it's both Mother Superior and a sister, > does that mean prologue.parallel is not called? No, prologue.parallel is not called on the master mom. And no, the mom is not both superior and sister, it is only superior. prologue.parallel is only called on sister nodes. > I did look through the Torque source, and I couldn't see > where it gets called, unlike prologue. src/resmom/mom_comm.c, search for IM_JOIN_JOB -- Mgr. Simon Toth From kenneth at sdsc.edu Mon Nov 28 12:27:07 2011 From: kenneth at sdsc.edu (Kenneth Yoshimoto) Date: Mon, 28 Nov 2011 11:27:07 -0800 (PST) Subject: [torqueusers] UC prologue.parallel? In-Reply-To: <4ED3D6E1.7000607@fi.muni.cz> References: <20111111142402.786B383A802B@mail.adaptivecomputing.com> <4ED3D6E1.7000607@fi.muni.cz> Message-ID: Thanks for the pointer to the code! Appendix G of the admin manual says that Mother Superior is also a sister node wrt prologue and epilogue scripts. Maybe an update to the docs is needed? http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/a.gprologueepilogue.php Thanks, Kenneth On Mon, 28 Nov 2011, "Mgr. ?imon T?th" wrote: > Date: Mon, 28 Nov 2011 19:45:53 +0100 > From: "Mgr. ?imon T?th" > To: Kenneth Yoshimoto , > Torque Users Mailing List > Subject: Re: [torqueusers] UC prologue.parallel? > >> I tried changing to mode 500. Also, tried to touch >> a file in /tmp to verify that the script ran. No dice. >> Looks like this never gets called. I'm testing on a >> single node. Since it's both Mother Superior and a sister, >> does that mean prologue.parallel is not called? > > No, prologue.parallel is not called on the master mom. And no, the mom is not > both superior and sister, it is only superior. > > prologue.parallel is only called on sister nodes. > >> I did look through the Torque source, and I couldn't see >> where it gets called, unlike prologue. > > src/resmom/mom_comm.c, search for IM_JOIN_JOB > > From rosmond at reachone.com Mon Nov 28 12:25:28 2011 From: rosmond at reachone.com (Tom Rosmond) Date: Mon, 28 Nov 2011 11:25:28 -0800 Subject: [torqueusers] Help with NUMA support Message-ID: <1322508328.4317.72.camel@cedar.reachone.com> A colleague and I are trying to reconfigure a Linux system with TORQUE NUMA support. Here are some details of the system 1. 48 processor : 'lstopo' output gives 8 NUMA nodes, 6 cores/node. 2. Debian linux running 2.6.32-5-amd64 kernel 3. Open_mpi 1.5.3, configured with 'libnuma' support. We previously had TORQUE successfully configured and running without NUMA support, but this wasn't satisfactory for running multiple MPI jobs concurrently. Here are the steps we have taken: 1. Reconfigured TORQUE with --enable-num-support 2. Created 'mom.layout' in /var/spool/torque/mom_priv with: cpus=0-5 mem=0 cpus=6-11 mem=1 cpus=12-17 mem=2 cpus=18-23 mem=3 cpus=24-29 mem=4 cpus=30-35 mem=5 cpus=36-41 mem=6 cpus=42-47 mem=7 based on the 'lstopo' output. 3. created 'nodes' file in /var/spool/torque/server_priv with: notus np=48 num_numa_nodes=8 where 'notus' is the host name. 4. restarted 'pbs_mom', 'pbs_sched', and 'pbs_server'. 5. submitted MPI jobs with, e.g. '-l nodes=4:ppn=6' for PBS resources and 'mpirun -np 24' for MPI. With this we are getting the following error messages in the 'sched_logs' file: 11/28/2011 12:10:18;0040; pbs_sched;Job;10.notus.nrl.navy.mil;Not enough of the right type of nodes available 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-0;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-1;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-2;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-3;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-4;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-5;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-6;Can not open connection to mom 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-7;Can not open connection to mom What are we missing? Any suggestions or advice? T. Rosmond From cmr9 at leicester.ac.uk Mon Nov 28 14:45:56 2011 From: cmr9 at leicester.ac.uk (Rudge, Chris M. (Dr.)) Date: Mon, 28 Nov 2011 21:45:56 +0000 Subject: [torqueusers] qstat: End of file Message-ID: <743BF94942C0B649BA31DCD560007C2B01496F8869C0@EXC-MBX1.cfs.le.ac.uk> Hi, We had a problem today running v2.5.5 (which has been running without issue for some time). A user submitted a very large number of jobs today, many of which have dependencies. The number of jobs submitted during the day was >40,000. We use Moab for the scheduler and have maxjobs set to 32768, which has previously been OK, and my understanding is that the excess of jobs simply remain in torque - again I've seen this happen in the past. When torque crashed, an attempt was made to clear out the queued jobs by deleting the .SC and .JB files from server_priv/jobs for the jobs which corresponded to those submitted by the user during the day. The pbs_server process starts correctly and reads in the jobs and targeted qstat commands, such as "qstat -B" or "qstat -u " all work correctly. However, any attempt to run a "qstat -u ..." for the user who submitted the large number of jobs or to simply run qstat to display all jobs results in the output "qstat: End of file" and the pbs_server process crashes. Also, letting Moab run as scheduler will cause the pbs_server process to crash when it queries torque for all of the job info. I'm guessing that there's some corruption in the serverdb (but could be wrong about this) given that there's obviously still a record of jobs for the "bad" user given the crash when querying jobs for him. Is this a reasonable conclusion to reach or are there other known reasons for encountering this issue. If it is a corrupted serverdb, given that there are many jobs running and others queued, is there a way to recover the db without losing the running/queued jobs? Regards, Chris Dr Chris Rudge - Research Computing Services Manager IT Services, University of Leicester, LE1 7RH Tel: 0116 2522223 Times Higher Education University of the year 2008/9 From dbeer at adaptivecomputing.com Mon Nov 28 14:53:04 2011 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 28 Nov 2011 14:53:04 -0700 (MST) Subject: [torqueusers] Help with NUMA support In-Reply-To: <1322508328.4317.72.camel@cedar.reachone.com> Message-ID: <65c04562-424d-4d15-90d1-ae9cadd7283a@mail> ----- Original Message ----- > A colleague and I are trying to reconfigure a Linux system with > TORQUE > NUMA support. Here are some details of the system > > 1. 48 processor : 'lstopo' output gives 8 NUMA nodes, 6 cores/node. > > 2. Debian linux running 2.6.32-5-amd64 kernel > > 3. Open_mpi 1.5.3, configured with 'libnuma' support. > > We previously had TORQUE successfully configured and running without > NUMA support, but this wasn't satisfactory for running multiple MPI > jobs > concurrently. Here are the steps we have taken: > > 1. Reconfigured TORQUE with --enable-num-support > > 2. Created 'mom.layout' in /var/spool/torque/mom_priv with: > > cpus=0-5 mem=0 > cpus=6-11 mem=1 > cpus=12-17 mem=2 > cpus=18-23 mem=3 > cpus=24-29 mem=4 > cpus=30-35 mem=5 > cpus=36-41 mem=6 > cpus=42-47 mem=7 > > based on the 'lstopo' output. > > 3. created 'nodes' file in /var/spool/torque/server_priv with: > > notus np=48 num_numa_nodes=8 > > where 'notus' is the host name. > > > 4. restarted 'pbs_mom', 'pbs_sched', and 'pbs_server'. > > > 5. submitted MPI jobs with, e.g. '-l nodes=4:ppn=6' for PBS resources > and 'mpirun -np 24' for MPI. > > > With this we are getting the following error messages in the > 'sched_logs' file: > > 11/28/2011 12:10:18;0040; pbs_sched;Job;10.notus.nrl.navy.mil;Not > enough > of the right type of nodes available > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-0;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-1;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-2;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-3;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-4;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-5;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-6;Can not open > connection > to mom > 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-7;Can not open > connection > to mom > > > What are we missing? Any suggestions or advice? > > T. Rosmond > Tom, Are you running pbs_sched? pbs_sched has not been updated to support NUMA scheduling, and there are currently no plans to update it in order to make that happen. This must be disappointing but I'm sure you understand that we cannot do development that competes with the scheduler our company sells. I'm afraid you're going to need to purchase a scheduler that can handle this kind of hardware. -- David Beer Direct Line: 801-717-3386 | Fax: 801-717-3738 Adaptive Computing 1712 S East Bay Blvd, Suite 300 Provo, UT 84606 From r.oostenveld at donders.ru.nl Tue Nov 29 01:40:24 2011 From: r.oostenveld at donders.ru.nl (Robert Oostenveld) Date: Tue, 29 Nov 2011 09:40:24 +0100 Subject: [torqueusers] qrerun fails due to Unauthorized Request In-Reply-To: <4472098e-f48e-4cd1-8c7c-a244965abdb0@mail> References: <4472098e-f48e-4cd1-8c7c-a244965abdb0@mail> Message-ID: <0DD9EF59-77E7-4E2F-A39E-7D8CD2795CDA@donders.ru.nl> Dear David and Glen, Thanks for your replies. Let me shortly comment on some aspects. On 21 Nov 2011, at 23:52, David Beer wrote: > This may not interest you, but there are schedulers (such as Moab) that manage licences for you and would prevent this case. I looked at the floating resources in Moab, but it does not seem to be smart enough to know that a single user on a single multi core computer can start multiple MATLAB instances while requiring only a single license. So the actually required resource is once license for all jobs in case they all run on a single node, whereas it is one license per job if they run on separate nodes. Since we run many MATLABs in parallel on our 48-core nodes, this distinction is rather important: it is the difference between scheduling for 1 license versus 48 licenses. One of the reasons of having these 48-core nodes is that they allow us to minimize license usage. I have also attempted to use dynamic consumable generic resources, but http://www.clusterresources.com/bugzilla/show_bug.cgi?id=118 is stopping me from doing so. On 21 Nov 2011, at 23:52, David Beer wrote: > AFAIK, the way it is intended (this was done a long time before I began working on TORQUE, possibly before even Adaptive/CR die) is that the client commands go in bin and the commands that can't ever be run by users go in the sbin directory. qrun also is in the bin directory, even though it also cannot be run unless you are a manager. On 22 Nov 2011, at 1:13, glen.beane at gmail.com wrote: > /usr/local/bin is appropriate in my opinion. A torque manager/operator does not have to be a super user and as such the commands do not belong in sbin The bin location makes sense on the torque server. But I still don't see why commands that cannot be run on a client (not even by an administrator) would get installed along with the "torque-package-clients". best regards, Robert From roman.ricardo at gmail.com Tue Nov 29 08:45:08 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Tue, 29 Nov 2011 09:45:08 -0600 Subject: [torqueusers] Fwd: Cluster Questions In-Reply-To: References: Message-ID: Hello everyone thanks for the time of reading and the long post :P The question is about multiple queues with Torque: We have here different clusternodes with difrente architectures: 4 PS-3 3 CPU+GPU 2 CPU and i want to be able to send jobs to each of hte nodes independly (using torque). Im guessing that having several queues and that each node belonging to a queue in particular and then submittint jobs to that queue will do the trick: say i got 4 queues IBMCELL with the 4 PS-3 TESLA with the 3 nodes that have GPUs XEON with te 5 nodes that have Xeons (which in turn 3 of them have teslas :P) and when i submit a job: qsub -q IBMCELL a.pbs should run on the PS-3 only, but im not being able to make it work like that. As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > *create queue uno > **set queue uno queue_type = Execution > **set queue uno acl_host_enable = False > **set queue uno acl_hosts = zarate-0+zarate-1 > **set queue uno enabled = True > **set queue uno started = True > *# > # Create and define queue dos > # > *create queue dos > **set queue dos queue_type = Execution > **set queue dos acl_host_enable = **False** > **set queue dos acl_hosts = zarate-2+zarate-3 > **set queue dos enabled = True > **set queue dos started = True > *# > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 22 and i changed the *nodes* file in the server_priv directory so it is like this (zarate are just the hostname :P): > > zarate-0 np=2 uno > zarate-1 np=2 uno > zarate-2 np=2 dos > zarate-3 np=2 dos but its not working... when i launch a job: #PBS -N mpi_hello > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out with teh command: #PBS -N mpi_hello /usr/local/bin/mpiexec -n 8 /home/rroman/a.out the output file is: zarate-1: hello world from process 2 of 8 > zarate-2: hello world from process 5 of 8 > zarate-2: hello world from process 6 of 8 > zarate-3: hello world from process 0 of 8 > zarate-3: hello world from process 7 of 8 > zarate-1: hello world from process 3 of 8 > zarate-0: hello world from process 4 of 8 > zarate-3: hello world from process 1 of 8 And there it shows that the job is running in ALL the nodes instead of running only in zarate-0 and zarate-1 as the queue said (according to me :P) SO! the question is: is it possible to do waht i want like this? and if so, what am i doing wrong! :P Thank you Kay! -ricardo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111129/ab76ec9c/attachment.html From siegert at sfu.ca Wed Nov 30 12:14:23 2011 From: siegert at sfu.ca (Martin Siegert) Date: Wed, 30 Nov 2011 11:14:23 -0800 Subject: [torqueusers] cpusets Message-ID: <20111130191423.GD26428@stikine.sfu.ca> Hi, we just recently started using cpusets and I do not have much experience with them. However, by now I noticed several times that MPI jobs (openmpi with TM) slow down dramatically: apparently two processes are using the same core (i.e., both only get 50% cpu usage) even though the number of cores in the cpuset equals the number of processes of the mpi job on the particular node. E.g., top - 11:05:24 up 42 days, 22:43, 2 users, load average: 6.99, 6.93, 6.68 Tasks: 468 total, 8 running, 460 sleeping, 0 stopped, 0 zombie Cpu(s): 24.9%us, 0.2%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24675188k total, 12099684k used, 12575504k free, 69968k buffers Swap: 16777208k total, 29932k used, 16747276k free, 9946292k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3717 user1 25 0 183m 91m 14m R 100.0 0.4 15:43.62 Clark 4526 user2 25 0 109m 36m 3088 R 100.0 0.2 2:02.43 mdrun 15863 user3 25 0 459m 163m 15m R 100.0 0.7 711:26.30 wrfm_arw.exe 15864 user3 25 0 452m 156m 15m R 100.0 0.6 688:28.80 wrfm_arw.exe 4562 user2 25 0 109m 36m 3088 R 99.7 0.2 0:23.02 mdrun 15861 user3 25 0 462m 165m 15m R 50.2 0.7 510:02.12 wrfm_arw.exe 15862 user3 25 0 465m 169m 15m R 49.9 0.7 446:21.37 wrfm_arw.exe root at b311:~> cat /proc/15861/cpuset /torque/4913985.b0 root at b311:~> cat /proc/15862/cpuset /torque/4913985.b0 (same for 15863, 15864) and root at b311:~> ls /dev/cpuset//torque/4913985.b0 68 cpu_exclusive memory_pressure notify_on_release 69 cpus memory_spread_page sched_relax_domain_level 70 mem_exclusive memory_spread_slab tasks 71 memory_migrate mems root at b311:~> cat /dev/cpuset/torque/4913985.b0/cpus 0-1,4,8 Do processes within a cpuset get bound to a particular cpu? If yes, how do I find out which one? Anyway, if you have na idea what could be causing this and how to solve this problem, please let me know. Thanks! Cheers, Martin -- Martin Siegert Simon Fraser University From david at unistra.fr Wed Nov 30 12:31:35 2011 From: david at unistra.fr (R. David) Date: Wed, 30 Nov 2011 20:31:35 +0100 Subject: [torqueusers] cpusets In-Reply-To: <20111130191423.GD26428@stikine.sfu.ca> References: <20111130191423.GD26428@stikine.sfu.ca> Message-ID: Hello, In the openmpi configuration file openmpi-mca-params.conf, just add : mpi_paffinity_alone = 1 Le 30 nov. 2011 ? 20:14, Martin Siegert a ?crit : > Hi, > > we just recently started using cpusets and I do not have much experience > with them. However, by now I noticed several times that MPI jobs > (openmpi with TM) slow down dramatically: apparently two processes > are using the same core (i.e., both only get 50% cpu usage) even though > the number of cores in the cpuset equals the number of processes > of the mpi job on the particular node. > > E.g., > > top - 11:05:24 up 42 days, 22:43, 2 users, load average: 6.99, 6.93, 6.68 > Tasks: 468 total, 8 running, 460 sleeping, 0 stopped, 0 zombie > Cpu(s): 24.9%us, 0.2%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 24675188k total, 12099684k used, 12575504k free, 69968k buffers > Swap: 16777208k total, 29932k used, 16747276k free, 9946292k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 3717 user1 25 0 183m 91m 14m R 100.0 0.4 15:43.62 Clark > 4526 user2 25 0 109m 36m 3088 R 100.0 0.2 2:02.43 mdrun > 15863 user3 25 0 459m 163m 15m R 100.0 0.7 711:26.30 wrfm_arw.exe > 15864 user3 25 0 452m 156m 15m R 100.0 0.6 688:28.80 wrfm_arw.exe > 4562 user2 25 0 109m 36m 3088 R 99.7 0.2 0:23.02 mdrun > 15861 user3 25 0 462m 165m 15m R 50.2 0.7 510:02.12 wrfm_arw.exe > 15862 user3 25 0 465m 169m 15m R 49.9 0.7 446:21.37 wrfm_arw.exe > > root at b311:~> cat /proc/15861/cpuset > /torque/4913985.b0 > root at b311:~> cat /proc/15862/cpuset > /torque/4913985.b0 > > (same for 15863, 15864) and > > root at b311:~> ls /dev/cpuset//torque/4913985.b0 > 68 cpu_exclusive memory_pressure notify_on_release > 69 cpus memory_spread_page sched_relax_domain_level > 70 mem_exclusive memory_spread_slab tasks > 71 memory_migrate mems > root at b311:~> cat /dev/cpuset/torque/4913985.b0/cpus > 0-1,4,8 > > Do processes within a cpuset get bound to a particular cpu? > If yes, how do I find out which one? > > Anyway, if you have na idea what could be causing this and how to > solve this problem, please let me know. > > Thanks! > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --------------------------------------------------------- R. David - david at unistra.fr Responsable du meso-centre UdS / Direction Informatique Tel. : 03 68 85 45 48 --------------------------------------------------------- From roman.ricardo at gmail.com Wed Nov 30 12:37:41 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 30 Nov 2011 13:37:41 -0600 Subject: [torqueusers] specific nodes Message-ID: Hello everyone thanks for the time of reading and the long post :P The question is about multiple queues with Torque: We have here different clusternodes with difrente architectures: 4 PS-3 3 CPU+GPU 2 CPU and i want to be able to send jobs to each of hte nodes independly (using torque). Im guessing that having several queues and that each node belonging to a queue in particular and then submittint jobs to that queue will do the trick: say i got 4 queues IBMCELL with the 4 PS-3 TESLA with the 3 nodes that have GPUs XEON with te 5 nodes that have Xeons (which in turn 3 of them have teslas :P) and when i submit a job: qsub -q IBMCELL a.pbs should run on the PS-3 only, but im not being able to make it work like that. As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > *create queue uno > **set queue uno queue_type = Execution > **set queue uno acl_host_enable = False > **set queue uno acl_hosts = zarate-0+zarate-1 > **set queue uno enabled = True > **set queue uno started = True > *# > # Create and define queue dos > # > *create queue dos > **set queue dos queue_type = Execution > **set queue dos acl_host_enable = **False** > **set queue dos acl_hosts = zarate-2+zarate-3 > **set queue dos enabled = True > **set queue dos started = True > *# > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 22 and i changed the *nodes* file in the server_priv directory so it is like this (zarate are just the hostname :P): > > zarate-0 np=2 uno > zarate-1 np=2 uno > zarate-2 np=2 dos > zarate-3 np=2 dos but its not working... when i launch a job: #PBS -N mpi_hello > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out with teh command: #PBS -N mpi_hello /usr/local/bin/mpiexec -n 8 /home/rroman/a.out the output file is: zarate-1: hello world from process 2 of 8 > zarate-2: hello world from process 5 of 8 > zarate-2: hello world from process 6 of 8 > zarate-3: hello world from process 0 of 8 > zarate-3: hello world from process 7 of 8 > zarate-1: hello world from process 3 of 8 > zarate-0: hello world from process 4 of 8 > zarate-3: hello world from process 1 of 8 And there it shows that the job is running in ALL the nodes instead of running only in zarate-0 and zarate-1 as the queue said (according to me :P) SO! the question is: is it possible to do waht i want like this? and if so, what am i doing wrong! :P Thank you Kay! -ricardo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111130/f6acfd9c/attachment-0001.html From gus at ldeo.columbia.edu Wed Nov 30 12:46:19 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 30 Nov 2011 14:46:19 -0500 Subject: [torqueusers] cpusets In-Reply-To: <20111130191423.GD26428@stikine.sfu.ca> References: <20111130191423.GD26428@stikine.sfu.ca> Message-ID: <00F1C9D3-733B-476E-95AC-23F075C3983A@ldeo.columbia.edu> Hi Martin For what it is worth, we use OpenMPI with Torque (TM) support, where Torque is configured with cpusets. As far as I know, Torque doesn't bind processes to CPUs. Hence, we use the OpenMPI '-mca mpi_paffinity_alone 1 ' flag in the mpiexec command to do that. I think '-bind-to-core' does pretty much the same, see mpiexec man page. Whether this speeds up or not the code may depend on the code itself, I guess. Also, I guess it may get tricky if you are sharing the node among several jobs, particularly if processes from different jobs share the different CPU sockets. [We use single job node policy in Maui.] Torque has also a configuration feature '--enable-geometry-requests' that *may* be used to address this issue. Gus Correa On Nov 30, 2011, at 2:14 PM, Martin Siegert wrote: > Hi, > > we just recently started using cpusets and I do not have much experience > with them. However, by now I noticed several times that MPI jobs > (openmpi with TM) slow down dramatically: apparently two processes > are using the same core (i.e., both only get 50% cpu usage) even though > the number of cores in the cpuset equals the number of processes > of the mpi job on the particular node. > > E.g., > > top - 11:05:24 up 42 days, 22:43, 2 users, load average: 6.99, 6.93, 6.68 > Tasks: 468 total, 8 running, 460 sleeping, 0 stopped, 0 zombie > Cpu(s): 24.9%us, 0.2%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 24675188k total, 12099684k used, 12575504k free, 69968k buffers > Swap: 16777208k total, 29932k used, 16747276k free, 9946292k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 3717 user1 25 0 183m 91m 14m R 100.0 0.4 15:43.62 Clark > 4526 user2 25 0 109m 36m 3088 R 100.0 0.2 2:02.43 mdrun > 15863 user3 25 0 459m 163m 15m R 100.0 0.7 711:26.30 wrfm_arw.exe > 15864 user3 25 0 452m 156m 15m R 100.0 0.6 688:28.80 wrfm_arw.exe > 4562 user2 25 0 109m 36m 3088 R 99.7 0.2 0:23.02 mdrun > 15861 user3 25 0 462m 165m 15m R 50.2 0.7 510:02.12 wrfm_arw.exe > 15862 user3 25 0 465m 169m 15m R 49.9 0.7 446:21.37 wrfm_arw.exe > > root at b311:~> cat /proc/15861/cpuset > /torque/4913985.b0 > root at b311:~> cat /proc/15862/cpuset > /torque/4913985.b0 > > (same for 15863, 15864) and > > root at b311:~> ls /dev/cpuset//torque/4913985.b0 > 68 cpu_exclusive memory_pressure notify_on_release > 69 cpus memory_spread_page sched_relax_domain_level > 70 mem_exclusive memory_spread_slab tasks > 71 memory_migrate mems > root at b311:~> cat /dev/cpuset/torque/4913985.b0/cpus > 0-1,4,8 > > Do processes within a cpuset get bound to a particular cpu? > If yes, how do I find out which one? > > Anyway, if you have na idea what could be causing this and how to > solve this problem, please let me know. > > Thanks! > > Cheers, > Martin > > -- > Martin Siegert > Simon Fraser University > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lloyd_brown at byu.edu Wed Nov 30 12:52:12 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 30 Nov 2011 12:52:12 -0700 Subject: [torqueusers] specific nodes In-Reply-To: References: Message-ID: <4ED6896C.5060808@byu.edu> Ricardo, Have you seen section 4.1.4 ("Mapping a Queue to a Subset of Resources") in the Torque documentation? It might give you some ideas. However, the short answer to your question, as seen in that section is this: > TORQUE does not currently provide a simple mechanism for mapping queues to nodes. However, schedulers such as Moab and Maui can provide this functionality. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 11/30/2011 12:37 PM, Ricardo Rom?n Brenes wrote: > Hello everyone thanks for the time of reading and the long post :P > > > The question is about multiple queues with Torque: > > > We have here different clusternodes with difrente architectures: > 4 PS-3 > 3 CPU+GPU > 2 CPU > > and i want to be able to send jobs to each of hte nodes independly > (using torque). Im guessing that having several queues and that each > node belonging to a queue in particular and then submittint jobs to that > queue will do the trick: > > say i got 4 queues > IBMCELL with the 4 PS-3 > TESLA with the 3 nodes that have GPUs > XEON with te 5 nodes that have Xeons (which in turn 3 of them have > teslas :P) > > and when i submit a job: > qsub -q IBMCELL a.pbs > should run on the PS-3 only, but im not being able to make it work like > that. > > As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): > > # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > *create queue uno > **set queue uno queue_type = Execution > **set queue uno acl_host_enable = False > **set queue uno acl_hosts = zarate-0+zarate-1 > **set queue uno enabled = True > **set queue uno started = True > *# > # Create and define queue dos > # > *create queue dos > **set queue dos queue_type = Execution > **set queue dos acl_host_enable = **False** > **set queue dos acl_hosts = zarate-2+zarate-3 > **set queue dos enabled = True > **set queue dos started = True > *# > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 22 > > > and i changed the _nodes_ file in the server_priv directory so it is > like this (zarate are just the hostname :P): > > > zarate-0 np=2 uno > zarate-1 np=2 uno > zarate-2 np=2 dos > zarate-3 np=2 dos > > > > but its not working... when i launch a job: > > #PBS -N mpi_hello > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > with teh command: > > #PBS -N mpi_hello > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > the output file is: > > zarate-1: hello world from process 2 of 8 > zarate-2: hello world from process 5 of 8 > zarate-2: hello world from process 6 of 8 > zarate-3: hello world from process 0 of 8 > zarate-3: hello world from process 7 of 8 > zarate-1: hello world from process 3 of 8 > zarate-0: hello world from process 4 of 8 > zarate-3: hello world from process 1 of 8 > > > > And there it shows that the job is running in ALL the nodes instead of > running only in zarate-0 and zarate-1 as the queue said (according to me :P) > > > > > SO! the question is: is it possible to do waht i want like this? and if > so, what am i doing wrong! :P > > Thank you Kay! > > -ricardo > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Nov 30 12:56:17 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 30 Nov 2011 13:56:17 -0600 Subject: [torqueusers] specific nodes In-Reply-To: <4ED6896C.5060808@byu.edu> References: <4ED6896C.5060808@byu.edu> Message-ID: so wrong mailing list huh? sorry to bother thanks for your time On Wed, Nov 30, 2011 at 1:52 PM, Lloyd Brown wrote: > Ricardo, > > Have you seen section 4.1.4 ("Mapping a Queue to a Subset of Resources") > in the Torque documentation? It might give you some ideas. However, > the short answer to your question, as seen in that section is this: > > > TORQUE does not currently provide a simple mechanism for mapping queues > to nodes. However, schedulers such as Moab and Maui can provide this > functionality. > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > > > On 11/30/2011 12:37 PM, Ricardo Rom?n Brenes wrote: > > Hello everyone thanks for the time of reading and the long post :P > > > > > > The question is about multiple queues with Torque: > > > > > > We have here different clusternodes with difrente architectures: > > 4 PS-3 > > 3 CPU+GPU > > 2 CPU > > > > and i want to be able to send jobs to each of hte nodes independly > > (using torque). Im guessing that having several queues and that each > > node belonging to a queue in particular and then submittint jobs to that > > queue will do the trick: > > > > say i got 4 queues > > IBMCELL with the 4 PS-3 > > TESLA with the 3 nodes that have GPUs > > XEON with te 5 nodes that have Xeons (which in turn 3 of them have > > teslas :P) > > > > and when i submit a job: > > qsub -q IBMCELL a.pbs > > should run on the PS-3 only, but im not being able to make it work like > > that. > > > > As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): > > > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue uno > > # > > *create queue uno > > **set queue uno queue_type = Execution > > **set queue uno acl_host_enable = False > > **set queue uno acl_hosts = zarate-0+zarate-1 > > **set queue uno enabled = True > > **set queue uno started = True > > *# > > # Create and define queue dos > > # > > *create queue dos > > **set queue dos queue_type = Execution > > **set queue dos acl_host_enable = **False** > > **set queue dos acl_hosts = zarate-2+zarate-3 > > **set queue dos enabled = True > > **set queue dos started = True > > *# > > # Set server attributes. > > # > > set server scheduling = True > > set server acl_hosts = zarate-0 > > set server log_events = 511 > > set server mail_from = adm > > set server scheduler_iteration = 600 > > set server node_check_rate = 150 > > set server tcp_timeout = 6 > > set server next_job_number = 22 > > > > > > and i changed the _nodes_ file in the server_priv directory so it is > > like this (zarate are just the hostname :P): > > > > > > zarate-0 np=2 uno > > zarate-1 np=2 uno > > zarate-2 np=2 dos > > zarate-3 np=2 dos > > > > > > > > but its not working... when i launch a job: > > > > #PBS -N mpi_hello > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > with teh command: > > > > #PBS -N mpi_hello > > > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > the output file is: > > > > zarate-1: hello world from process 2 of 8 > > zarate-2: hello world from process 5 of 8 > > zarate-2: hello world from process 6 of 8 > > zarate-3: hello world from process 0 of 8 > > zarate-3: hello world from process 7 of 8 > > zarate-1: hello world from process 3 of 8 > > zarate-0: hello world from process 4 of 8 > > zarate-3: hello world from process 1 of 8 > > > > > > > > And there it shows that the job is running in ALL the nodes instead of > > running only in zarate-0 and zarate-1 as the queue said (according to me > :P) > > > > > > > > > > SO! the question is: is it possible to do waht i want like this? and if > > so, what am i doing wrong! :P > > > > Thank you Kay! > > > > -ricardo > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111130/3d8138f6/attachment.html From lloyd_brown at byu.edu Wed Nov 30 13:01:56 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 30 Nov 2011 13:01:56 -0700 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> Message-ID: <4ED68BB4.5030105@byu.edu> Not so much the wrong mailing list, but the wrong product. In the end Torque is really about resource management, launching jobs, etc., but not the decision making. They happen to include a very basic scheduler ("pbs_sched"), but it's very, very basic. If you want anything more, you're going to have to look at Moab or Maui, to use with Torque. Or there are other scheduling systems out there as well, that don't use Torque. For such a small/simple cluster, I'd recommend Torque with Maui, but you'll have to do some investigation. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 11/30/2011 12:56 PM, Ricardo Rom?n Brenes wrote: > so wrong mailing list huh? > > sorry to bother > > thanks for your time > > On Wed, Nov 30, 2011 at 1:52 PM, Lloyd Brown > wrote: > > Ricardo, > > Have you seen section 4.1.4 ("Mapping a Queue to a Subset of Resources") > in the Torque documentation? It might give you some ideas. However, > the short answer to your question, as seen in that section is this: > > > TORQUE does not currently provide a simple mechanism for mapping > queues to nodes. However, schedulers such as Moab and Maui can > provide this functionality. > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > > > On 11/30/2011 12:37 PM, Ricardo Rom?n Brenes wrote: > > Hello everyone thanks for the time of reading and the long post :P > > > > > > The question is about multiple queues with Torque: > > > > > > We have here different clusternodes with difrente architectures: > > 4 PS-3 > > 3 CPU+GPU > > 2 CPU > > > > and i want to be able to send jobs to each of hte nodes independly > > (using torque). Im guessing that having several queues and that each > > node belonging to a queue in particular and then submittint jobs > to that > > queue will do the trick: > > > > say i got 4 queues > > IBMCELL with the 4 PS-3 > > TESLA with the 3 nodes that have GPUs > > XEON with te 5 nodes that have Xeons (which in turn 3 of them have > > teslas :P) > > > > and when i submit a job: > > qsub -q IBMCELL a.pbs > > should run on the PS-3 only, but im not being able to make it work > like > > that. > > > > As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): > > > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue uno > > # > > *create queue uno > > **set queue uno queue_type = Execution > > **set queue uno acl_host_enable = False > > **set queue uno acl_hosts = zarate-0+zarate-1 > > **set queue uno enabled = True > > **set queue uno started = True > > *# > > # Create and define queue dos > > # > > *create queue dos > > **set queue dos queue_type = Execution > > **set queue dos acl_host_enable = **False** > > **set queue dos acl_hosts = zarate-2+zarate-3 > > **set queue dos enabled = True > > **set queue dos started = True > > *# > > # Set server attributes. > > # > > set server scheduling = True > > set server acl_hosts = zarate-0 > > set server log_events = 511 > > set server mail_from = adm > > set server scheduler_iteration = 600 > > set server node_check_rate = 150 > > set server tcp_timeout = 6 > > set server next_job_number = 22 > > > > > > and i changed the _nodes_ file in the server_priv directory so it is > > like this (zarate are just the hostname :P): > > > > > > zarate-0 np=2 uno > > zarate-1 np=2 uno > > zarate-2 np=2 dos > > zarate-3 np=2 dos > > > > > > > > but its not working... when i launch a job: > > > > #PBS -N mpi_hello > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > with teh command: > > > > #PBS -N mpi_hello > > > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > the output file is: > > > > zarate-1: hello world from process 2 of 8 > > zarate-2: hello world from process 5 of 8 > > zarate-2: hello world from process 6 of 8 > > zarate-3: hello world from process 0 of 8 > > zarate-3: hello world from process 7 of 8 > > zarate-1: hello world from process 3 of 8 > > zarate-0: hello world from process 4 of 8 > > zarate-3: hello world from process 1 of 8 > > > > > > > > And there it shows that the job is running in ALL the nodes instead of > > running only in zarate-0 and zarate-1 as the queue said (according > to me :P) > > > > > > > > > > SO! the question is: is it possible to do waht i want like this? > and if > > so, what am i doing wrong! :P > > > > Thank you Kay! > > > > -ricardo > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Nov 30 13:08:51 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 30 Nov 2011 14:08:51 -0600 Subject: [torqueusers] specific nodes In-Reply-To: <4ED68BB4.5030105@byu.edu> References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> Message-ID: Well I am using torque+maui but even so i cant get the maui to assign the nodes correctly; a job just runs on all nodes not just the ones i want ... On Wed, Nov 30, 2011 at 2:01 PM, Lloyd Brown wrote: > Not so much the wrong mailing list, but the wrong product. In the end > Torque is really about resource management, launching jobs, etc., but > not the decision making. They happen to include a very basic scheduler > ("pbs_sched"), but it's very, very basic. If you want anything more, > you're going to have to look at Moab or Maui, to use with Torque. Or > there are other scheduling systems out there as well, that don't use > Torque. > > For such a small/simple cluster, I'd recommend Torque with Maui, but > you'll have to do some investigation. > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > > > On 11/30/2011 12:56 PM, Ricardo Rom?n Brenes wrote: > > so wrong mailing list huh? > > > > sorry to bother > > > > thanks for your time > > > > On Wed, Nov 30, 2011 at 1:52 PM, Lloyd Brown > > wrote: > > > > Ricardo, > > > > Have you seen section 4.1.4 ("Mapping a Queue to a Subset of > Resources") > > in the Torque documentation? It might give you some ideas. However, > > the short answer to your question, as seen in that section is this: > > > > > TORQUE does not currently provide a simple mechanism for mapping > > queues to nodes. However, schedulers such as Moab and Maui can > > provide this functionality. > > > > > > Lloyd Brown > > Systems Administrator > > Fulton Supercomputing Lab > > Brigham Young University > > http://marylou.byu.edu > > > > > > > > On 11/30/2011 12:37 PM, Ricardo Rom?n Brenes wrote: > > > Hello everyone thanks for the time of reading and the long post :P > > > > > > > > > The question is about multiple queues with Torque: > > > > > > > > > We have here different clusternodes with difrente architectures: > > > 4 PS-3 > > > 3 CPU+GPU > > > 2 CPU > > > > > > and i want to be able to send jobs to each of hte nodes independly > > > (using torque). Im guessing that having several queues and that > each > > > node belonging to a queue in particular and then submittint jobs > > to that > > > queue will do the trick: > > > > > > say i got 4 queues > > > IBMCELL with the 4 PS-3 > > > TESLA with the 3 nodes that have GPUs > > > XEON with te 5 nodes that have Xeons (which in turn 3 of them have > > > teslas :P) > > > > > > and when i submit a job: > > > qsub -q IBMCELL a.pbs > > > should run on the PS-3 only, but im not being able to make it work > > like > > > that. > > > > > > As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): > > > > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue uno > > > # > > > *create queue uno > > > **set queue uno queue_type = Execution > > > **set queue uno acl_host_enable = False > > > **set queue uno acl_hosts = zarate-0+zarate-1 > > > **set queue uno enabled = True > > > **set queue uno started = True > > > *# > > > # Create and define queue dos > > > # > > > *create queue dos > > > **set queue dos queue_type = Execution > > > **set queue dos acl_host_enable = **False** > > > **set queue dos acl_hosts = zarate-2+zarate-3 > > > **set queue dos enabled = True > > > **set queue dos started = True > > > *# > > > # Set server attributes. > > > # > > > set server scheduling = True > > > set server acl_hosts = zarate-0 > > > set server log_events = 511 > > > set server mail_from = adm > > > set server scheduler_iteration = 600 > > > set server node_check_rate = 150 > > > set server tcp_timeout = 6 > > > set server next_job_number = 22 > > > > > > > > > and i changed the _nodes_ file in the server_priv directory so it > is > > > like this (zarate are just the hostname :P): > > > > > > > > > zarate-0 np=2 uno > > > zarate-1 np=2 uno > > > zarate-2 np=2 dos > > > zarate-3 np=2 dos > > > > > > > > > > > > but its not working... when i launch a job: > > > > > > #PBS -N mpi_hello > > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > > > > with teh command: > > > > > > #PBS -N mpi_hello > > > > > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > > > > the output file is: > > > > > > zarate-1: hello world from process 2 of 8 > > > zarate-2: hello world from process 5 of 8 > > > zarate-2: hello world from process 6 of 8 > > > zarate-3: hello world from process 0 of 8 > > > zarate-3: hello world from process 7 of 8 > > > zarate-1: hello world from process 3 of 8 > > > zarate-0: hello world from process 4 of 8 > > > zarate-3: hello world from process 1 of 8 > > > > > > > > > > > > And there it shows that the job is running in ALL the nodes > instead of > > > running only in zarate-0 and zarate-1 as the queue said (according > > to me :P) > > > > > > > > > > > > > > > SO! the question is: is it possible to do waht i want like this? > > and if > > > so, what am i doing wrong! :P > > > > > > Thank you Kay! > > > > > > -ricardo > > > > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111130/896926bf/attachment.html From roman.ricardo at gmail.com Wed Nov 30 13:15:30 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 30 Nov 2011 14:15:30 -0600 Subject: [torqueusers] different config shows in root and users accounts Message-ID: hi guys. So i have this Torque running and when i print the server as root i get: [root at zarate-0 ~]# qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue uno # create queue uno set queue uno queue_type = Execution set queue uno acl_host_enable = False set queue uno acl_hosts = zarate-1 set queue uno acl_hosts += zarate-0 set queue uno resources_default.neednodes = uno <---------------- THIS LINE set queue uno resources_default.nodes = 1:uno set queue uno enabled = True set queue uno started = True # # Create and define queue dos # create queue dos set queue dos queue_type = Execution set queue dos acl_host_enable = False set queue dos acl_hosts = zarate-3 set queue dos acl_hosts += zarate-2 set queue dos resources_default.neednodes = dos <---------------- AND THIS LINE set queue dos resources_default.nodes = 1:dos set queue dos enabled = True set queue dos started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = zarate-0 set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 37 but when it as a user i get: [rroman at zarate-0:~/outputs]$ qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue uno # create queue uno set queue uno queue_type = Execution set queue uno acl_host_enable = False set queue uno acl_hosts = zarate-1 set queue uno acl_hosts += zarate-0 set queue uno resources_default.nodes = 1:uno set queue uno enabled = True set queue uno started = True # # Create and define queue dos # create queue dos set queue dos queue_type = Execution set queue dos acl_host_enable = False set queue dos acl_hosts = zarate-3 set queue dos acl_hosts += zarate-2 set queue dos resources_default.nodes = 1:dos set queue dos enabled = True set queue dos started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = zarate-0 set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 37 see those 2 lines up there? those 2 dont show up as user... is this normal? could this be messing with my configuration in a way taht Maui cant assign the correct nodes on job submissions? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111130/dde76b83/attachment.html From gus at ldeo.columbia.edu Wed Nov 30 13:37:12 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 30 Nov 2011 15:37:12 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> Message-ID: Hi Ricardo We do something along these lines here with Maui and Torque. In the Torque $Torque/server_priv/nodes file, add a distinctive 'property' to each type of node. Something like this, I call them here PS3, CPUGPU and CPUONLY [make the appropriate changes to your reality]: an_ibm_node np=8 PS3 ... a_gpu_node np=16 CPUGPU ... a_cpu_only_node np=4 CPUONLY [There may be an additional item on the CPUGPU line for Torque gpu control, maybe 'gpus=2' or something like that. I don't have GPU nodes, check the Torque Admin guide.] With Torque qmgr create the queues you want (IBMCELL, TESLA, XEON), and set the default nodes on each: set queue IBMCELL resources_default.neednodes = PS3 set queue TESLA resources_default.neednodes = CPUGPU set queue XEON resources_default.neednodes = CPUONLY Add this line to your $Maui/maui.cfg ENABLEMULTIREQJOBS TRUE Restart the pbs_server, and maui. We use this to separate development and production nodes [not for gpu] on a per-queue basis. The user needs only to indicate the queue and the number of nodes and ppn he/she wants [possibly the number of gpus also in your case] in the Torque submission script. No need to mention the node properties in the submission script. It works for me. Documentation is here: http://www.adaptivecomputing.com/resources/docs/ I hope this helps, Gus Correa On Nov 30, 2011, at 3:08 PM, Ricardo Rom?n Brenes wrote: > Well I am using torque+maui but even so i cant get the maui to assign the nodes correctly; a job just runs on all nodes not just the ones i want ... > > On Wed, Nov 30, 2011 at 2:01 PM, Lloyd Brown wrote: > Not so much the wrong mailing list, but the wrong product. In the end > Torque is really about resource management, launching jobs, etc., but > not the decision making. They happen to include a very basic scheduler > ("pbs_sched"), but it's very, very basic. If you want anything more, > you're going to have to look at Moab or Maui, to use with Torque. Or > there are other scheduling systems out there as well, that don't use Torque. > > For such a small/simple cluster, I'd recommend Torque with Maui, but > you'll have to do some investigation. > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > > > On 11/30/2011 12:56 PM, Ricardo Rom?n Brenes wrote: > > so wrong mailing list huh? > > > > sorry to bother > > > > thanks for your time > > > > On Wed, Nov 30, 2011 at 1:52 PM, Lloyd Brown > > wrote: > > > > Ricardo, > > > > Have you seen section 4.1.4 ("Mapping a Queue to a Subset of Resources") > > in the Torque documentation? It might give you some ideas. However, > > the short answer to your question, as seen in that section is this: > > > > > TORQUE does not currently provide a simple mechanism for mapping > > queues to nodes. However, schedulers such as Moab and Maui can > > provide this functionality. > > > > > > Lloyd Brown > > Systems Administrator > > Fulton Supercomputing Lab > > Brigham Young University > > http://marylou.byu.edu > > > > > > > > On 11/30/2011 12:37 PM, Ricardo Rom?n Brenes wrote: > > > Hello everyone thanks for the time of reading and the long post :P > > > > > > > > > The question is about multiple queues with Torque: > > > > > > > > > We have here different clusternodes with difrente architectures: > > > 4 PS-3 > > > 3 CPU+GPU > > > 2 CPU > > > > > > and i want to be able to send jobs to each of hte nodes independly > > > (using torque). Im guessing that having several queues and that each > > > node belonging to a queue in particular and then submittint jobs > > to that > > > queue will do the trick: > > > > > > say i got 4 queues > > > IBMCELL with the 4 PS-3 > > > TESLA with the 3 nodes that have GPUs > > > XEON with te 5 nodes that have Xeons (which in turn 3 of them have > > > teslas :P) > > > > > > and when i submit a job: > > > qsub -q IBMCELL a.pbs > > > should run on the PS-3 only, but im not being able to make it work > > like > > > that. > > > > > > As a test i made 2 queues in the PS3 pbs_server ("uno" and "dos"): > > > > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue uno > > > # > > > *create queue uno > > > **set queue uno queue_type = Execution > > > **set queue uno acl_host_enable = False > > > **set queue uno acl_hosts = zarate-0+zarate-1 > > > **set queue uno enabled = True > > > **set queue uno started = True > > > *# > > > # Create and define queue dos > > > # > > > *create queue dos > > > **set queue dos queue_type = Execution > > > **set queue dos acl_host_enable = **False** > > > **set queue dos acl_hosts = zarate-2+zarate-3 > > > **set queue dos enabled = True > > > **set queue dos started = True > > > *# > > > # Set server attributes. > > > # > > > set server scheduling = True > > > set server acl_hosts = zarate-0 > > > set server log_events = 511 > > > set server mail_from = adm > > > set server scheduler_iteration = 600 > > > set server node_check_rate = 150 > > > set server tcp_timeout = 6 > > > set server next_job_number = 22 > > > > > > > > > and i changed the _nodes_ file in the server_priv directory so it is > > > like this (zarate are just the hostname :P): > > > > > > > > > zarate-0 np=2 uno > > > zarate-1 np=2 uno > > > zarate-2 np=2 dos > > > zarate-3 np=2 dos > > > > > > > > > > > > but its not working... when i launch a job: > > > > > > #PBS -N mpi_hello > > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > > > > with teh command: > > > > > > #PBS -N mpi_hello > > > > > > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > > > > > > the output file is: > > > > > > zarate-1: hello world from process 2 of 8 > > > zarate-2: hello world from process 5 of 8 > > > zarate-2: hello world from process 6 of 8 > > > zarate-3: hello world from process 0 of 8 > > > zarate-3: hello world from process 7 of 8 > > > zarate-1: hello world from process 3 of 8 > > > zarate-0: hello world from process 4 of 8 > > > zarate-3: hello world from process 1 of 8 > > > > > > > > > > > > And there it shows that the job is running in ALL the nodes instead of > > > running only in zarate-0 and zarate-1 as the queue said (according > > to me :P) > > > > > > > > > > > > > > > SO! the question is: is it possible to do waht i want like this? > > and if > > > so, what am i doing wrong! :P > > > > > > Thank you Kay! > > > > > > -ricardo > > > > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Nov 30 14:11:58 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 30 Nov 2011 15:11:58 -0600 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> Message-ID: Ill post some more info since im pretty desperate right now :P this is my nodes file: zarate-0 np=2 uno zarate-1 np=2 uno zarate-2 np=2 dos zarate-3 np=2 dos these are my queues: # # Create queues and set their attributes. # # # Create and define queue uno # create queue uno set queue uno queue_type = Execution set queue uno resources_default.neednodes = uno set queue uno enabled = True set queue uno started = True # # Create and define queue dos # create queue dos set queue dos queue_type = Execution set queue dos resources_default.neednodes = dos set queue dos enabled = True set queue dos started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = zarate-0 set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server next_job_number = 44 and my maui.cfg: # maui.cfg 3.3 SERVERHOST zarate-0 ADMIN1 root RMCFG[zarate-0] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 QUEUETIMEWEIGHT 1 BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY MINRESOURCE ENABLEMULTIREQJOBS TRUE CLASSCFG[uno] CLASSCFG[dos] *note: running qmgr -c "p s" as a regular user non-root i get a different config display... so Im running this hellow.c mpi example, it just says hi from different nodes: #PBS -N hello_w #PPS -q uno /usr/local/bin/mpiexec -n 8 /home/rroman/a.out the output im expecting is that only the nodes with property "uno" should say hi but., this is the actual output: zarate-0: hello world from process 2 of 8 zarate-2: hello world from process 3 of 8 zarate-3: hello world from process 5 of 8 zarate-1: hello world from process 0 of 8 zarate-3: hello world from process 6 of 8 zarate-1: hello world from process 7 of 8 zarate-2: hello world from process 4 of 8 zarate-1: hello world from process 1 of 8 They all greet me... =( -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111130/6d287bba/attachment.html From gus at ldeo.columbia.edu Wed Nov 30 15:22:58 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 30 Nov 2011 17:22:58 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> Message-ID: <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> You don't have 8 CPUs of type 'uno'. This seems to conflict with your mpirun command with -np=8. You need to match the number of processors you request from Torque and the number of processes you launch with mpirun. Also, you wrote: #PPS -q uno Is this a typo in your email or in your Torque submission script? It should be: #PBS -q uno In addition, your PBS script doesn't request nodes, something like #PBS -l nodes=1:ppn=2 I suppose it will use the default for the queue uno. However, your qmgr configuation doesn't set a default number of nodes to use, either for the queues or for the server itself. You could do: qmgr -c 'set queue uno resources_default.nodes = 1' and likewise for queue dos. More important, is your mpi [and mpiexec] built with Torque support? For instance, OpenMPI can be built with Torque support, so that it will use the nodes provided by Torque to run the job. However, stock packaged MPIs from yum or apt-get are probably not integrated with Torque. You would need to build it from source, which is not really hard. If you use an mpi that is not integrated with Torque, you need to pass to mpirun/mpiexec the file created by Torque with the node list. The file name is held by the environment variable $PBS_NODEFILE. The syntax vary depending on which mpi you are using, check your mpirun man page, but should be something like: mpirun -hostfile $PBS_NODEFILE -np 2 ./a.out [ The flag may be -machinefile instead of -hostfile, or something else, depending on your MPI.] On Nov 30, 2011, at 4:11 PM, Ricardo Rom?n Brenes wrote: > Ill post some more info since im pretty desperate right now :P > Oh, yes. You should always do this, if you want help from the list. Do you see how much more help you get when you give all the information? :) I hope this helps, Gus Correa > > this is my nodes file: > zarate-0 np=2 uno > zarate-1 np=2 uno > zarate-2 np=2 dos > zarate-3 np=2 dos > > these are my queues: > # > # Create queues and set their attributes. > # > # > # Create and define queue uno > # > create queue uno > set queue uno queue_type = Execution > set queue uno resources_default.neednodes = uno > set queue uno enabled = True > set queue uno started = True > # > # Create and define queue dos > # > create queue dos > set queue dos queue_type = Execution > set queue dos resources_default.neednodes = dos > set queue dos enabled = True > set queue dos started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = zarate-0 > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 44 > > > and my maui.cfg: > > # maui.cfg 3.3 > > SERVERHOST zarate-0 > ADMIN1 root > > RMCFG[zarate-0] TYPE=PBS > > AMCFG[bank] TYPE=NONE > > RMPOLLINTERVAL 00:00:30 > > SERVERPORT 42559 > SERVERMODE NORMAL > > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > > QUEUETIMEWEIGHT 1 > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > NODEALLOCATIONPOLICY MINRESOURCE > > ENABLEMULTIREQJOBS TRUE > > CLASSCFG[uno] > CLASSCFG[dos] > > > *note: running qmgr -c "p s" as a regular user non-root i get a different config display... > > > so Im running this hellow.c mpi example, it just says hi from different nodes: > #PBS -N hello_w > #PPS -q uno > /usr/local/bin/mpiexec -n 8 /home/rroman/a.out > > > > the output im expecting is that only the nodes with property "uno" should say hi but., this is the actual output: > zarate-0: hello world from process 2 of 8 > zarate-2: hello world from process 3 of 8 > zarate-3: hello world from process 5 of 8 > zarate-1: hello world from process 0 of 8 > zarate-3: hello world from process 6 of 8 > zarate-1: hello world from process 7 of 8 > zarate-2: hello world from process 4 of 8 > zarate-1: hello world from process 1 of 8 > > > They all greet me... =( > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From roman.ricardo at gmail.com Wed Nov 30 15:38:44 2011 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Wed, 30 Nov 2011 16:38:44 -0600 Subject: [torqueusers] specific nodes In-Reply-To: <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: Thank you so much for your help =) yet I still have matters to discuss. On Wed, Nov 30, 2011 at 4:22 PM, Gustavo Correa wrote: > You don't have 8 CPUs of type 'uno'. > This seems to conflict with your mpirun command with -np=8. > You need to match the number of processors you request from Torque and > the number of processes you launch with mpirun. > > 1. Why there has to be a match between processors and processes? i could run 1024 process in 1 processor (without torque). Requesting 2 nodes i could spawn 10000 processes... > Also, you wrote: > > #PPS -q uno > > Is this a typo in your email or in your Torque submission script? > It should be: > > #PBS -q uno > > In addition, your PBS script doesn't request nodes, something like > #PBS -l nodes=1:ppn=2 > I suppose it will use the default for the queue uno. > However, your qmgr configuation doesn't set a default number of nodes to > use, > either for the queues or for the server itself. > > You could do: > qmgr -c 'set queue uno resources_default.nodes = 1' > and likewise for queue dos. > > 2. thats in fact a type. In the script it says #PBS > More important, is your mpi [and mpiexec] built with Torque support? > For instance, OpenMPI can be built with Torque support, so that it > will use the nodes provided by Torque to run the job. > However, stock packaged MPIs from yum or apt-get are probably not > integrated with Torque. > You would need to build it from source, which is not really hard. > > If you use an mpi that is not integrated with Torque, you need to pass to > mpirun/mpiexec > the file created by Torque with the node list. > The file name is held by the environment variable $PBS_NODEFILE. > The syntax vary depending on which mpi you are using, check your mpirun > man page, > but should be something like: > > mpirun -hostfile $PBS_NODEFILE -np 2 ./a.out > > 3. My MPICH2 is version 1.2.1p1. I dont recall if i compiled it with torque support. Even so i dont' have a vairable $PBS_NODEFILE. (doing a "echo $PBS_NODEFILE" returns an empty line). 4. I dont know if this is my problem or not but you talk about mpirun and mpiexec like if they were the same, yet i have used mpiexec most of the time and im not sure about the similiarities (or differences). You asked if my MPIEXEC is built with torque but a few lines below you mention MPIRUN > [ The flag may be -machinefile instead of -hostfile, or something else, > depending on your MPI.] > > > On Nov 30, 2011, at 4:11 PM, Ricardo Rom?n Brenes wrote: > > > Ill post some more info since im pretty desperate right now :P > > > > Oh, yes. > You should always do this, if you want help from the list. > Do you see how much more help you get when you give all the information? > :) > > > I hope this helps, > Gus Correa > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111130/5928cabb/attachment.html From lloyd_brown at byu.edu Wed Nov 30 15:51:43 2011 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Wed, 30 Nov 2011 15:51:43 -0700 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: <4ED6B37F.3060605@byu.edu> On 11/30/2011 03:38 PM, Ricardo Rom?n Brenes wrote: > > 1. Why there has to be a match between processors and processes? i could > run 1024 process in 1 processor (without torque). Requesting 2 nodes i > could spawn 10000 processes... > > Ricardo, I suspect it was just a general recommendation. You're right. Nothing is keeping you from launching more processes than you have processors. Having said that, though, in general, it's a bad idea. Unless your processes are spending a significant amount of time idle or blocked (eg. doing I/O), then you will see significant slowdowns. Also, if another job is on the same node, the processes might bump into each other, and both will slow down. > > mpirun -hostfile $PBS_NODEFILE -np 2 ./a.out > > > 3. My MPICH2 is version 1.2.1p1. I dont recall if i compiled it with > torque support. Even so i dont' have a vairable $PBS_NODEFILE. (doing a > "echo $PBS_NODEFILE" returns an empty line). > The $PBS_NODEFILE variable is only populated within a running job's environment. It contains a path to a file that lists the nodes that your job was assigned. So, inside of a job "echo $PBS_NODEFILE" should give you the path to that temporary file. And "cat $PBS_NODEFILE" will give you the contents. > > 4. I dont know if this is my problem or not but you talk about mpirun > and mpiexec like if they were the same, yet i have used mpiexec most of > the time and im not sure about the similiarities (or differences). You > asked if my MPIEXEC is built with torque but a few lines below you > mention MPIRUN > In the early days of MPICH (eg. MPICH1, not MPICH2), mpirun was provided by MPICH, and mpiexec was something separate. I don't know if the same holds true with MPICH2 or not; like Gustavo, I mostly use OpenMPI, where they're both the same. Given what you've told us so far, you potentially have two separate problems: The scheduler, and the MPI process launching. It might make the most sense to focus on just the scheduler for the time being. What happens if the entire body of your job script is just a "cat $PBS_NODEFILE", something like this: > #!/bin/bash > #PBS -q uno > #PBS -l nodes=2:ppn=2,walltime=00:00:30 > echo "Nodes Assigned:" > cat $PBS_NODEFILE > echo "done" From gus at ldeo.columbia.edu Wed Nov 30 17:21:33 2011 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 30 Nov 2011 19:21:33 -0500 Subject: [torqueusers] specific nodes In-Reply-To: References: <4ED6896C.5060808@byu.edu> <4ED68BB4.5030105@byu.edu> <56AF3BA8-7E7F-47EB-862B-F5F68ECBDCC7@ldeo.columbia.edu> Message-ID: <14AC74A3-3CD3-4681-BD12-6B0225B512B6@ldeo.columbia.edu> Answers inline On Nov 30, 2011, at 5:38 PM, Ricardo Rom?n Brenes wrote: > Thank you so much for your help =) yet I still have matters to discuss. > > > On Wed, Nov 30, 2011 at 4:22 PM, Gustavo Correa wrote: > You don't have 8 CPUs of type 'uno'. > This seems to conflict with your mpirun command with -np=8. > You need to match the number of processors you request from Torque and > the number of processes you launch with mpirun. > > > > 1. Why there has to be a match between processors and processes? i could run 1024 process in 1 processor (without torque). Requesting 2 nodes i could spawn 10000 processes... > You can oversubscribe the processors with MPI tasks, if you want. The MPI distributions brag that you can do it, and in many cases it works alright. In general, if your MPI tasks are of 'hello world' type, oversubscribing is not a problem, you can run thousands of processes in a trifle of CPUs. However, if you are doing real HPC, then it is another story. Switching context very often doesn't seem to get along very well with MPI. Paging to disk will be a killer most likely. As far as I know, Torque is designed *not* to oversubscribe CPUs. Resource managers like Torque are designed in principle for HPC (but not only), so they tend to have this underlying assumption of "one processor for each process". If you want to oversubscribe with Torque, trick it by setting a larger number of processors in the $Torque/server_priv/nodes file [e.g. np=10000 instead of np=2]. It will probably run hello world right. Then try heavier algorithms, and deal with the consequences .... :) > > Also, you wrote: > > #PPS -q uno > > Is this a typo in your email or in your Torque submission script? > It should be: > > #PBS -q uno > > In addition, your PBS script doesn't request nodes, something like > #PBS -l nodes=1:ppn=2 > I suppose it will use the default for the queue uno. > However, your qmgr configuation doesn't set a default number of nodes to use, > either for the queues or for the server itself. > > You could do: > qmgr -c 'set queue uno resources_default.nodes = 1' > and likewise for queue dos. > > > > 2. thats in fact a type. In the script it says #PBS > > > > More important, is your mpi [and mpiexec] built with Torque support? > For instance, OpenMPI can be built with Torque support, so that it > will use the nodes provided by Torque to run the job. > However, stock packaged MPIs from yum or apt-get are probably not > integrated with Torque. > You would need to build it from source, which is not really hard. > > If you use an mpi that is not integrated with Torque, you need to pass to mpirun/mpiexec > the file created by Torque with the node list. > The file name is held by the environment variable $PBS_NODEFILE. > The syntax vary depending on which mpi you are using, check your mpirun man page, > but should be something like: > > mpirun -hostfile $PBS_NODEFILE -np 2 ./a.out > > > 3. My MPICH2 is version 1.2.1p1. I dont recall if i compiled it with torque support. Even so i dont' have a vairable $PBS_NODEFILE. (doing a "echo $PBS_NODEFILE" returns an empty line). > It is a Torque variable. You will have it inside the Torque submission script only, not on your Linux shell per se. Try "echo $PBS_NODEFILE" and "cat $PBS_NODEFILE" inside the Torque script. If you don't have Torque support in your MPICH2 you definitely need to pass the -machinefile or -hostfile $PBS_NODEFILE. In fact, if you already set some default machinefile/hostfile in your MPICH2 directories, you may be using that file inadvertently, instead of the nodes that Torque gives to your job. Did you set a default machine file in MPICH2? Does it contain all of your cluster? This may perhaps explain why your job executes in nodes that you didn't expect to it to be. > > 4. I dont know if this is my problem or not but you talk about mpirun and mpiexec like if they were the same, yet i have used mpiexec most of the time and im not sure about the similiarities (or differences). You asked if my MPIEXEC is built with torque but a few lines below you mention MPIRUN The traditional name is mpirun, most MPIs changed to mpiexec, many have both, sometimes just a soft link or alias to each other. Check what you have. Be careful if you installed several different MPIs, make sure you know exactly which one you are using to compile [mpicc,mpif90] and use the same to run [mpirun/mpiexec]. They *don't* mix well. Gus Correa > > [ The flag may be -machinefile instead of -hostfile, or something else, depending on your MPI.] > > > On Nov 30, 2011, at 4:11 PM, Ricardo Rom?n Brenes wrote: > > > Ill post some more info since im pretty desperate right now :P > > > > Oh, yes. > You should always do this, if you want help from the list. > Do you see how much more help you get when you give all the information? :) > > > I hope this helps, > Gus Correa > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From cholam20 at yahoo.co.in Wed Nov 30 17:33:50 2011 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Thu, 1 Dec 2011 06:03:50 +0530 (IST) Subject: [torqueusers] Hey there How are you!! Message-ID: <1322699630.51396.androidMobile@web137306.mail.in.yahoo.com>

Hola
I wanted to prove I could amount to something there is nothing else like this out here this was more than effective just imagine what it could do for you
http://bluecat.x.fc2.com/profile/92MartinJames/
ttyl.

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111201/55e8b371/attachment.html From zhaohscas at yahoo.com.cn Wed Nov 30 23:46:46 2011 From: zhaohscas at yahoo.com.cn (Hongsheng Zhao) Date: Thu, 01 Dec 2011 14:46:46 +0800 Subject: [torqueusers] Cannot connect to default server host 'node32' - check pbs_server daemon. Message-ID: <4ED722D6.10404@yahoo.com.cn> Hi all, When I issue the qstat -a command on my hpc, I meet the following errors: ---------- zhaohongsheng at node32:~> qstat -a Cannot connect to default server host 'node32' - check pbs_server daemon. qstat: cannot connect to server node32 (errno=111) Connection refused But the ping echo reply from the node32 is as follows: zhaohongsheng at node32:~> ping node32 PING node32.nxu.edu.cn (202.201.128.36) 56(84) bytes of data. 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=1 ttl=64 time=0.016 ms 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=2 ttl=64 time=0.010 ms 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=3 ttl=64 time=0.010 ms 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=4 ttl=64 time=0.015 ms 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=5 ttl=64 time=0.012 ms 64 bytes from node32.nxu.edu.cn (202.201.128.36): icmp_seq=6 ttl=64 time=0.013 ms --- node32.nxu.edu.cn ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5000ms rtt min/avg/max/mdev = 0.010/0.012/0.016/0.004 ms zhaohongsheng at node32:~> ---------- Any hints on this issue? Regards -- Hongsheng Zhao School of Physics and Electrical Information Science, Ningxia University, Yinchuan 750021, China