From cholam20 at yahoo.co.in Wed Feb 1 00:01:22 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Wed, 1 Feb 2012 12:31:22 +0530 (IST) Subject: [torqueusers] Fwd: This Kit changed all my life... Message-ID: <1328079682.68091.androidMobile@web137302.mail.in.yahoo.com>

ive been frustrated with myself lately this really intrigued me I had finally hit rock bottom...
http://lacadenasport.es/newsjournal/26MartinLewis/ now I can finally advance
no pressure just check it out.

see you later

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120201/e13f6f23/attachment-0001.html From fabien.archambault at univ-amu.fr Wed Feb 1 01:15:41 2012 From: fabien.archambault at univ-amu.fr (Fabien Archambault) Date: Wed, 1 Feb 2012 09:15:41 +0100 Subject: [torqueusers] error opening pipe in PBSD_munge_authenticate: errno = 24 Message-ID: Dear list, I updated torque to version 2.5.9 and also updated CentOS. Since this update I have some issues with munge. I do not know why it is useful but is a dependency of python-pbs which is required by jomonarch. I tried to find online some information on this message but did not get any result. This is the munge version: munge-0.5.8-8.el5 I created the key using create-munge-key on the master then copied to all nodes (all nodes have the same version). In /var/log/messages I do not have any reference to munge. # service munge status munged (pid 9524) is running... (also on nodes) # tail /var/log/munge/munged.log 2012-01-31 11:24:47 Notice: Exiting on signal=15 2012-01-31 11:24:47 Info: Wrote 1024 bytes to PRNG seed "/var/lib/munge/munge.seed" 2012-01-31 11:24:47 Notice: Stopping munge-0.5.8 daemon (pid 1704) 2012-01-31 11:24:47 Notice: Running on "slater.up.univ-mrs.fr" (147.94.185.151) 2012-01-31 11:24:47 Info: PRNG seeded with 1024 bytes from "/var/lib/munge/munge.seed" 2012-01-31 11:24:47 Info: Updating supplementary group mapping every 3600 seconds 2012-01-31 11:24:47 Info: Enabled supplementary group mtime check of "/etc/group" 2012-01-31 11:24:47 Notice: Starting munge-0.5.8 daemon (pid 9524) 2012-01-31 11:24:47 Info: Created 2 work threads 2012-01-31 11:24:47 Info: Found 35 users with supplementary groups in 0.001 seconds Torque is configured with ./configure --disable-mom --disable-cpuset --disable-gui --with-rcp=scp If someone has some clue. Thanks, Fabien -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120201/c49d4932/attachment.html From listsarnau at gmail.com Wed Feb 1 03:32:21 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Wed, 1 Feb 2012 11:32:21 +0100 Subject: [torqueusers] momctl -h $node issue Message-ID: <20120201113221.2483b218@amarrosa.pic.es> Hi all, every time I run once a momctl -d3 -h $node I get the error: # momctl -d3 -h $workernode ERROR: query[0] 'diag3' failed on $workernode (errno=0-Success: 5-Input/output error) the second time it works fine: # momctl -d3 -h $workernode Host: $workernode/$workernode Version: 2.5.9 PID: 13822 Init Msgs Received: 0 hellos/1 cluster-addrs [...] I'm wondering if we have any network issue or if this is normal command behaviour. Anyone had this problem before? What is the source of this error message? TIA, Arnau As it could be relevant, here it is our network conf: Our server is not in the same network/vlan as our client: server: 193.109.174 clients: 192.168.101 tracepath from server to client: # tracepath $workernode 1: $server (193.109.174.13) 0.218ms pmtu 1500 1: CISCO ROUTER (193.109.174.1) 0.868ms 2: ARISTA SWITCH/ROUTER (192.168.50.) 0.307ms 3: $workernode (192.168.101.108) 0.192ms reached Resume: pmtu 1500 hops 3 back 3 MTU on server is 1500, on nodes 9000. Nodes and client have a active-backup bonding From Gareth.Williams at csiro.au Thu Feb 2 20:17:36 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 3 Feb 2012 14:17:36 +1100 Subject: [torqueusers] requesting gpus Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au> Hi All, I added a basic gpus count information to one of our compute nodes with: qmgr -c 's n n121 gpus = 2' and it seems fine: > pbsnodes -a n121 n121 state = free np = 12 ntype = cluster status = rectime=1328238593,varattr=,jobs=,state=free,size=133709780kb:144492840kb,netload=156768229618,gres=,loadave=2.00,ncpus=24,physmem=99195396kb,availmem=95103784kb,totmem=101299868kb,idletime=173222,nusers=0,nsessions=0,uname=Linux n121 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64,opsys=sles11,arch=x86_64 mom_service_port = 15002 mom_manager_port = 15003 gpus = 2 However when I run a job with the recommended syntax: http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.7schedulinggpus.php I get: > qsub -I -q viz -l nodes=1:ppn=1:gpus=1 qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes The torque version is 3.0.3-snap.201108261653 Note that this is _not_ the --enable-nvidia-gpus functionality. Also note that the server has not been restarted. The scheduler is moab but I'm pretty sure the job gets rejected well before moab comes into the picture. Does anyone have such a setup working or can anyone see what is wrong (or have an idea of where to look)? Regards, Gareth -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/b5961f5e/attachment.html From cwest at vpac.org Thu Feb 2 22:55:18 2012 From: cwest at vpac.org (Craig West) Date: Fri, 03 Feb 2012 16:55:18 +1100 Subject: [torqueusers] requesting gpus In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au> Message-ID: <4F2B76C6.2080104@vpac.org> Hi Gareth, > However when I run a job with the recommended syntax: > http://www.adaptivecomputing.com/resources/docs/torque/3-0-3/3.7schedulinggpus.php > > I get: >> qsub -I -q viz -l nodes=1:ppn=1:gpus=1 > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > The torque version is 3.0.3-snap.201108261653 > > Note that this is _/not/_ the --enable-nvidia-gpus functionality. > Also note that the server has not been restarted. > The scheduler is moab but I?m pretty sure the job gets rejected well > before moab comes into the picture. > > Does anyone have such a setup working or can anyone see what is wrong > (or have an idea of where to look)? Your pbsnodes output looks correct and similar to our systems. Few questions for you: 1. What version of Moab are you running? 2. Does the Viz queue have the ability to schedule to that node? 3. What is in the "Configured Resources" line of "checknode n121"? It should have a "GPUS: 2" parameter. Cheers, Craig. -- Craig West Systems Manager Victorian Partnership for Advanced Computing 110 Victoria Street, Carlton South VIC 3053 P: +61 3 9925 4751 E: cwest at vpac.org http://www.vpac.org From jonathan.michalon at etu.unistra.fr Fri Feb 3 01:58:10 2012 From: jonathan.michalon at etu.unistra.fr (Jonathan Michalon) Date: Fri, 3 Feb 2012 09:58:10 +0100 Subject: [torqueusers] [Patch] GPUs by the way of GRES Message-ID: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> Hi Maui folks, GPUs in Maui are a long standing problem. Last year a patch was sent by Mariusz Mamo?ski [1], which works based on GRES parameters. I've just made GPUs kind of working, by enhancing that patch. Please find attached the resulting patch, which works well for Maui 3.3.1. It defines a special GRES named "gpu" which works as expected on my test cases. Note that GRES behaviour seems quite confused as sometimes they are mentioned as consumable. This patch annihilates this behaviour, for the needs of GPUs. To use the patch: get the sources of maui-3.3.1 and patch them: patch -p1 < ../Patch-for-gpu-GRES.patch then compile as usual. You have to configure the GPUs in maui.cfg: NODECFG[nodename] GRES=gpu:2 Then when queuing jobs you can request GPUs with (Torque syntax): qsub -W x=GRES:gpu at 1 I hope this helps, please test this and enhance to your needs! [1] http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html Regards, PS. This is the second attempt to send the mail? -- Jonathan Michalon IT student in Strasbourg -------------- next part -------------- A non-text attachment was scrubbed... Name: Patch-for-gpu-GRES.patch Type: text/x-patch Size: 4803 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/e6847559/attachment-0001.bin From Gareth.Williams at csiro.au Fri Feb 3 02:08:18 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 3 Feb 2012 20:08:18 +1100 Subject: [torqueusers] requesting gpus In-Reply-To: <4F2B76C6.2080104@vpac.org> References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au> <4F2B76C6.2080104@vpac.org> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD2@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Craig West [mailto:cwest at vpac.org] > Sent: Friday, 3 February 2012 4:55 PM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] requesting gpus > > > Hi Gareth, > > > However when I run a job with the recommended syntax: > > http://www.adaptivecomputing.com/resources/docs/torque/3-0- > 3/3.7schedulinggpus.php > > > > I get: > >> qsub -I -q viz -l nodes=1:ppn=1:gpus=1 > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible > nodes > > > > The torque version is 3.0.3-snap.201108261653 > > > > Note that this is _/not/_ the --enable-nvidia-gpus functionality. > > Also note that the server has not been restarted. > > The scheduler is moab but I'm pretty sure the job gets rejected well > > before moab comes into the picture. > > > > Does anyone have such a setup working or can anyone see what is wrong > > (or have an idea of where to look)? > > Your pbsnodes output looks correct and similar to our systems. > > Few questions for you: > 1. What version of Moab are you running? > 2. Does the Viz queue have the ability to schedule to that node? > 3. What is in the "Configured Resources" line of "checknode n121"? > It should have a "GPUS: 2" parameter. > > Cheers, > Craig. -snip- 1) Moab Version: 6.0.2 - due for an upgrade anytime 2) yes - and I can get jobs there with gpus as a gres but that doesn't count them right 3) > checknode n121 | grep Configu Configured Resources: PROCS: 12 MEM: 94G SWAP: 96G DISK: 137G GPUS: 2 But I think moab is not getting to play a role. I've looked at logs but confess that I've not turned up the logging level yet. Gareth From david at unistra.fr Fri Feb 3 02:20:24 2012 From: david at unistra.fr (R. David) Date: Fri, 3 Feb 2012 10:20:24 +0100 Subject: [torqueusers] [Patch] GPUs by the way of GRES In-Reply-To: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> References: <20120203095810.6ba1833b@RunningPenguin.chalmion.homelinux.net> Message-ID: <9DB98485-EECB-48D1-8AEC-5F0877E6704D@unistra.fr> Hello, Here at the Computing center of the University of Strasbourg, we have been using this patch with great success. It makes GPU access much easier for our users, and our batch configuration is now fully operational for GPUs. Regards, Le 3 f?vr. 2012 ? 09:58, Jonathan Michalon a ?crit : > Hi Maui folks, > > GPUs in Maui are a long standing problem. Last year a patch was sent by Mariusz > Mamo?ski [1], which works based on GRES parameters. > I've just made GPUs kind of working, by enhancing that patch. Please find > attached the resulting patch, which works well for Maui 3.3.1. > It defines a special GRES named "gpu" which works as expected on my test cases. > > Note that GRES behaviour seems quite confused as sometimes they are mentioned > as consumable. This patch annihilates this behaviour, for the needs of GPUs. > > To use the patch: > get the sources of maui-3.3.1 and patch them: > patch -p1 < ../Patch-for-gpu-GRES.patch > then compile as usual. > > You have to configure the GPUs in maui.cfg: > NODECFG[nodename] GRES=gpu:2 > > Then when queuing jobs you can request GPUs with (Torque syntax): > qsub -W x=GRES:gpu at 1 > > I hope this helps, please test this and enhance to your needs! > > [1] > http://www.supercluster.org/pipermail/mauiusers/2011-April/004622.html > > Regards, > > PS. This is the second attempt to send the mail? > > -- > Jonathan Michalon > IT student in Strasbourg > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --------------------------------------------------------- R. David - david at unistra.fr Responsable du meso-centre UdS / Direction Informatique Tel. : 03 68 85 45 48 --------------------------------------------------------- From leggett at mcs.anl.gov Fri Feb 3 08:27:11 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 3 Feb 2012 09:27:11 -0600 Subject: [torqueusers] Torque not honoring max_user_queuable Message-ID: We've set queue limits that don't seem to be honored: sdb:~ # qstat | grep linpyl | grep batch | wc 945 5670 82215 sdb:~ # qmgr -c "print queue batch" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_user_queuable = 500 set queue batch resources_min.mppwidth = 1 set queue batch resources_default.mppwidth = 24 set queue batch resources_default.walltime = 00:10:00 set queue batch acl_group_enable = False set queue batch resources_available.nodes = 726 set queue batch enabled = True set queue batch started = True How would it be possible for a user to have 945 jobs in the queue when the limit should be 500? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/eedf3610/attachment.bin From leggett at mcs.anl.gov Fri Feb 3 08:29:41 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 3 Feb 2012 09:29:41 -0600 Subject: [torqueusers] Torque 2.5.9 MOMs keep segfaulting In-Reply-To: References: <1ebb959b-2dde-4ef0-9f8e-089d2ffb5d29@mail> Message-ID: <37E90BE2-E4D5-4345-BD8E-510A5A03BC97@mcs.anl.gov> Some more information on this problem. The issue is triggered by one user who is using the Intel MPI implementation and using MPDs instead of hydra. My guess is the MPDs are trying to communicate outside of the MOM and this is confusing the MOMs and causing them to bail. I've asked the user to switch to hydra instead but haven't heard back yet. On Jan 16, 2012, at 10:44 AM, Ti Leggett wrote: > They seem to die immediately. I can't really run them in gdb since it's randomly on nodes and I haven't found a way to trigger the failure. > > On Jan 11, 2012, at 2:52 PM, David Beer wrote: > >> Do they segfault right away? If you can't find a core file, would it be possible to run the mom in gdb and get a backtrace of the crash when it happens? >> >> David >> >> ----- Original Message ----- >>> torque was configured with --with-debug, "ulimit -c unlimited" is in >>> the init script right before the moms are started like >>> "/usr/sbin/pbs_mom -p -d /var/spool/torque" but I'm still not seeing >>> a core file anywhere. >>> >>> On Jan 11, 2012, at 10:26 AM, David Beer wrote: >>> >>>> >>>> >>>> ----- Original Message ----- >>>>> I finally got around to doing this, but I don't see a core file in >>>>> /var/spool/torque or in /usr/sbin. Where would the core get >>>>> dumped? >>>>> >>>> >>>> A mom's core file would be in /var/spool/torque/mom_priv. You need >>>> to make sure ulimit -c is unlimited or set to a very large number. >>>> >>>> David >>>> >>>>> On Dec 20, 2011, at 3:03 PM, Ken Nielson wrote: >>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Troy Baer" >>>>>>> To: "Torque Users Mailing List" >>>>>>> Sent: Tuesday, December 20, 2011 8:59:56 AM >>>>>>> Subject: Re: [torqueusers] Torque 2.5.9 MOMs keep segfaulting >>>>>>> >>>>>>> On Thu, 2011-12-08 at 10:36 -0600, Ti Leggett wrote: >>>>>>>> I just upgraded from 2.5.7 to 2.5.9 on Tuesday and since then, >>>>>>>> MOMs >>>>>>>> keep randomly segfaulting and dying. I see this in the MOM log >>>>>>>> right before dying: >>>>>>>> >>>>>>>> 12/08/2011 10:09:14;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad >>>>>>>> file >>>>>>>> descriptor (9) in tm_request, comm failed Protocol failure in >>>>>>>> commit >>>>>>>> >>>>>>>> >>>>>>>> And something similar to this in dmesg: >>>>>>>> >>>>>>>> pbs_mom[22354]: segfault at 0000000000000008 rip >>>>>>>> 00002b585249ed6f >>>>>>>> rsp 00007fff19e96df0 error 4 >>>>>>> >>>>>>> We've also seen this on one of our systems and had to fall back >>>>>>> to >>>>>>> 2.5.8 >>>>>>> on it. >>>>>>> >>>>>>> --Troy >>>>>>> -- >>>>>>> Troy Baer, HPC System Administrator >>>>>>> National Institute for Computational Sciences, University of >>>>>>> Tennessee >>>>>>> http://www.nics.tennessee.edu/ >>>>>>> Phone: 865-241-4233 >>>>>> >>>>>> Could someone configure TORQUE using --with-debug and then send a >>>>>> stack trace of the crash? >>>>>> >>>>>> Ken >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>> >>>> -- >>>> David Beer >>>> Direct Line: 801-717-3386 | Fax: 801-717-3738 >>>> Adaptive Computing >>>> 1712 S East Bay Blvd, Suite 300 >>>> Provo, UT 84606 >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >> -- >> David Beer >> Direct Line: 801-717-3386 | Fax: 801-717-3738 >> Adaptive Computing >> 1712 S East Bay Blvd, Suite 300 >> Provo, UT 84606 >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/3c72600c/attachment.bin From dbeer at adaptivecomputing.com Fri Feb 3 09:06:45 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 3 Feb 2012 09:06:45 -0700 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: References: Message-ID: Ti, How are you submitting the jobs? I assume this is TORQUE 2.5.9? David On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote: > We've set queue limits that don't seem to be honored: > > sdb:~ # qstat | grep linpyl | grep batch | wc > 945 5670 82215 > > sdb:~ # qmgr -c "print queue batch" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch max_user_queuable = 500 > set queue batch resources_min.mppwidth = 1 > set queue batch resources_default.mppwidth = 24 > set queue batch resources_default.walltime = 00:10:00 > set queue batch acl_group_enable = False > set queue batch resources_available.nodes = 726 > set queue batch enabled = True > set queue batch started = True > > How would it be possible for a user to have 945 jobs in the queue when the > limit should be 500? > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/663ab9c8/attachment.html From leggett at mcs.anl.gov Fri Feb 3 09:21:01 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 3 Feb 2012 10:21:01 -0600 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: References: Message-ID: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools? On Feb 3, 2012, at 10:06 AM, David Beer wrote: > Ti, > > How are you submitting the jobs? I assume this is TORQUE 2.5.9? > > David > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote: > We've set queue limits that don't seem to be honored: > > sdb:~ # qstat | grep linpyl | grep batch | wc > 945 5670 82215 > > sdb:~ # qmgr -c "print queue batch" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch max_user_queuable = 500 > set queue batch resources_min.mppwidth = 1 > set queue batch resources_default.mppwidth = 24 > set queue batch resources_default.walltime = 00:10:00 > set queue batch acl_group_enable = False > set queue batch resources_available.nodes = 726 > set queue batch enabled = True > set queue batch started = True > > How would it be possible for a user to have 945 jobs in the queue when the limit should be 500? > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/ae9acfe5/attachment-0001.bin From dbeer at adaptivecomputing.com Fri Feb 3 10:03:17 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 3 Feb 2012 10:03:17 -0700 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> References: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> Message-ID: If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high. David On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote: > I'm assuming using qsub, but it's other users doing this so I'm not 100% > sure. Is there a way to find out from logs or other tools? > > On Feb 3, 2012, at 10:06 AM, David Beer wrote: > > > Ti, > > > > How are you submitting the jobs? I assume this is TORQUE 2.5.9? > > > > David > > > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote: > > We've set queue limits that don't seem to be honored: > > > > sdb:~ # qstat | grep linpyl | grep batch | wc > > 945 5670 82215 > > > > sdb:~ # qmgr -c "print queue batch" > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch max_user_queuable = 500 > > set queue batch resources_min.mppwidth = 1 > > set queue batch resources_default.mppwidth = 24 > > set queue batch resources_default.walltime = 00:10:00 > > set queue batch acl_group_enable = False > > set queue batch resources_available.nodes = 726 > > set queue batch enabled = True > > set queue batch started = True > > > > How would it be possible for a user to have 945 jobs in the queue when > the limit should be 500? > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/a6d38b71/attachment.html From leggett at mcs.anl.gov Fri Feb 3 10:15:14 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Fri, 3 Feb 2012 11:15:14 -0600 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: References: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> Message-ID: <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov> submit_args = -A CI-MCB000083 -l walltime=48:00:00, mppwidth=48 /lustre/beagle/linpyl/project.qsub On Feb 3, 2012, at 11:03 AM, David Beer wrote: > If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high. > > David > > On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote: > I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools? > > On Feb 3, 2012, at 10:06 AM, David Beer wrote: > > > Ti, > > > > How are you submitting the jobs? I assume this is TORQUE 2.5.9? > > > > David > > > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote: > > We've set queue limits that don't seem to be honored: > > > > sdb:~ # qstat | grep linpyl | grep batch | wc > > 945 5670 82215 > > > > sdb:~ # qmgr -c "print queue batch" > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch max_user_queuable = 500 > > set queue batch resources_min.mppwidth = 1 > > set queue batch resources_default.mppwidth = 24 > > set queue batch resources_default.walltime = 00:10:00 > > set queue batch acl_group_enable = False > > set queue batch resources_available.nodes = 726 > > set queue batch enabled = True > > set queue batch started = True > > > > How would it be possible for a user to have 945 jobs in the queue when the limit should be 500? > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/d6fd7730/attachment.bin From jjc at iastate.edu Fri Feb 3 10:36:05 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 3 Feb 2012 17:36:05 +0000 Subject: [torqueusers] Torque not honoring max_user_queuable : Two commands to check In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E22101825614@ITSDAG1D.its.iastate.edu> Ti Legget, I'd suggest checking two commands to confirm that there is a problem: 1) Really simple issue: Make sure your count is correct: Issue: qstat -u linpyl | awk '$3 == "batch" {print}' | wc -l to see if this exceeds 500. The command that you displayed would count jobs with name batchjob submitted by a user whose name includes linpyl as part of the name. (so user linpylon could be adding to the total, or linpyl could have jobs in two different queues called batchjob. I encountered these issues because I have users who have similar names and I have users who use the same name for every job. The command above should avoid these issues to get a reliable count. 2) Did a torque admin change the max_user_queuable before/after these jobs were submitted? Check the pbs_server logs to see if max_user_queuable was changed after these jobs were submitted. I am a torque admin, so I could get around max_user_queable, by changing it and changing it back, as could any other torque admin, and as could someone who has root privileges (knows root password or has sudo capability). The evidence should be in the logs then, though. grep max_user_queuable /var/spool/torque/server_logs/2012* should get the answer to this questions. I have two backups, and a user could call them to ask them up up the count temporarily. If you see evidence of this, I'd ask the other torque admins first. James Coyle, PhD High Performance Computing Group 115 Durham Center Iowa State Univ. phone: (515)-294-2099 Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Ti Leggett >Sent: Friday, February 03, 2012 9:27 AM >To: Torque Users Mailing List >Subject: [torqueusers] Torque not honoring max_user_queuable > >We've set queue limits that don't seem to be honored: > >sdb:~ # qstat | grep linpyl | grep batch | wc > 945 5670 82215 > >sdb:~ # qmgr -c "print queue batch" ># ># Create queues and set their attributes. ># ># ># Create and define queue batch ># >create queue batch >set queue batch queue_type = Execution >set queue batch max_user_queuable = 500 >set queue batch resources_min.mppwidth = 1 set queue batch >resources_default.mppwidth = 24 set queue batch >resources_default.walltime = 00:10:00 set queue batch >acl_group_enable = False set queue batch resources_available.nodes = >726 set queue batch enabled = True set queue batch started = True > >How would it be possible for a user to have 945 jobs in the queue >when the limit should be 500? From dbeer at adaptivecomputing.com Fri Feb 3 10:44:55 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 3 Feb 2012 10:44:55 -0700 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov> References: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov> Message-ID: I'm also curious - is this done through a routing queue or routing queues? Is it class remapping in Moab? It looks like it isn't qsub -q David On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett wrote: > submit_args = -A CI-MCB000083 -l walltime=48:00:00, > mppwidth=48 /lustre/beagle/linpyl/project.qsub > > On Feb 3, 2012, at 11:03 AM, David Beer wrote: > > > If you qstat -f a few of the jobs you can see the submit arguments. At > higher log levels the entire job submission is there, but I don't known if > your log levels would be that high. > > > > David > > > > On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote: > > I'm assuming using qsub, but it's other users doing this so I'm not 100% > sure. Is there a way to find out from logs or other tools? > > > > On Feb 3, 2012, at 10:06 AM, David Beer wrote: > > > > > Ti, > > > > > > How are you submitting the jobs? I assume this is TORQUE 2.5.9? > > > > > > David > > > > > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett > wrote: > > > We've set queue limits that don't seem to be honored: > > > > > > sdb:~ # qstat | grep linpyl | grep batch | wc > > > 945 5670 82215 > > > > > > sdb:~ # qmgr -c "print queue batch" > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue batch > > > # > > > create queue batch > > > set queue batch queue_type = Execution > > > set queue batch max_user_queuable = 500 > > > set queue batch resources_min.mppwidth = 1 > > > set queue batch resources_default.mppwidth = 24 > > > set queue batch resources_default.walltime = 00:10:00 > > > set queue batch acl_group_enable = False > > > set queue batch resources_available.nodes = 726 > > > set queue batch enabled = True > > > set queue batch started = True > > > > > > How would it be possible for a user to have 945 jobs in the queue when > the limit should be 500? > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > > > > > > -- > > > David Beer | Software Engineer > > > Adaptive Computing > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120203/61c6331a/attachment-0001.html From gadre at wisc.edu Fri Feb 3 10:57:06 2012 From: gadre at wisc.edu (Milind) Date: Fri, 03 Feb 2012 11:57:06 -0600 Subject: [torqueusers] Maui does not know queue to node map? - queue system is failing, please HELP ! In-Reply-To: <76f0eb901311f0.4f2c1fcc@wiscmail.wisc.edu> References: <777090be12ac6d.4f2c1036@wiscmail.wisc.edu> <7560af4b12f902.4f2c1074@wiscmail.wisc.edu> <7620b4a21297e8.4f2c10b0@wiscmail.wisc.edu> <75b0f66212cca8.4f2c10ed@wiscmail.wisc.edu> <75b09046129462.4f2c1129@wiscmail.wisc.edu> <75d0d3ee129b08.4f2c1166@wiscmail.wisc.edu> <7730b28512af6d.4f2c1259@wiscmail.wisc.edu> <7780ba0b12ab65.4f2c1295@wiscmail.wisc.edu> <75d0eec212aeb7.4f2c12d2@wiscmail.wisc.edu> <778092d912f8fe.4f2c130e@wiscmail.wisc.edu> <76f0c0d5129241.4f2c138b@wiscmail.wisc.edu> <75e097c412b192.4f2c13ca@wiscmail.wisc.edu> <76208747134a35.4f2c1715@wiscmail.wisc.edu> <7770a2ea137481.4f2c1752@wiscmail.wisc.edu> <75b088e8133cd6.4f2c178e@wiscmail.wisc.edu> <773097d6136bf7.4f2c1cf5@wiscmail.wisc.edu> <7770ba80131d91.4f2c1ed8@wiscmail.wisc.edu> <7560dc5c134556.4f2c1f53@wiscmail.wisc.edu> <75d0f8dd135f34.4f2c1f90@wiscmail.wisc.edu> <76f0eb901311f0.4f2c1fcc@wiscmail.wisc.edu> Message-ID: <7730cf1e133768.4f2bcb92@wiscmail.wisc.edu> Hello, I am a cluster administrator at the University of Wisconsin-Madison. At our cluster we have Maui (3.2.5), OpenPBS 2.3 on the ROCKS 5.3 system. For last few days, our queue system has been haywire : the PBS accepts jobs and puts them in right queues, but the scheduler somehow does something in the middle, and the job ends up on a 'wrong' compute node (which is not supposed to be in that queue), all the while PBS still lists that job as running under the right queue. example, PBS shows this: Job id??????????????????? Name???????????? User??????????? Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 60606.bardeen???????????? Cu1_a60_mov????? ??????? 00:52:05 R fast ???? but the job is on a compute node which is not at all in the queue "fast" ! The pbs nodelist (/opt/torque/server_priv/nodes ) is all fine, no errors in maui logs. In pbs logs, I get this message ?10:54:19;0008;PBS_Server;Job;60606.bardeen.msae.wisc.edu;Job Modified at request of maui at bardeen.msae.wisc.edu My guess is that maui is doing something wrong / it does not know the correct queue - to - node mapping. Can someone suggest what is going on or guide me to solve this issue ?? thanks !! Milind From jjc at iastate.edu Fri Feb 3 11:39:01 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Fri, 3 Feb 2012 18:39:01 +0000 Subject: [torqueusers] Maui does not know queue to node map? - queue system is failing, please HELP ! In-Reply-To: <7730cf1e133768.4f2bcb92@wiscmail.wisc.edu> References: <777090be12ac6d.4f2c1036@wiscmail.wisc.edu> <7560af4b12f902.4f2c1074@wiscmail.wisc.edu> <7620b4a21297e8.4f2c10b0@wiscmail.wisc.edu> <75b0f66212cca8.4f2c10ed@wiscmail.wisc.edu> <75b09046129462.4f2c1129@wiscmail.wisc.edu> <75d0d3ee129b08.4f2c1166@wiscmail.wisc.edu> <7730b28512af6d.4f2c1259@wiscmail.wisc.edu> <7780ba0b12ab65.4f2c1295@wiscmail.wisc.edu> <75d0eec212aeb7.4f2c12d2@wiscmail.wisc.edu> <778092d912f8fe.4f2c130e@wiscmail.wisc.edu> <76f0c0d5129241.4f2c138b@wiscmail.wisc.edu> <75e097c412b192.4f2c13ca@wiscmail.wisc.edu> <76208747134a35.4f2c1715@wiscmail.wisc.edu> <7770a2ea137481.4f2c1752@wiscmail.wisc.edu> <75b088e8133cd6.4f2c178e@wiscmail.wisc.edu> <773097d6136bf7.4f2c1cf5@wiscmail.wisc.edu> <7770ba80131d91.4f2c1ed8@wiscmail.wisc.edu> <7560dc5c134556.4f2c1f53@wiscmail.wisc.edu> <75d0f8dd135f34.4f2c1f90@wiscmail.wisc.edu> <76f0eb901311f0.4f2c1fcc@wiscmail.wisc.edu> <7730cf1e133768.4f2bcb92@wiscmail.wisc.edu> Message-ID: <242421BFAF465844BE24EB90BB97E22101825674@ITSDAG1D.its.iastate.edu> This is the Torque mailing list, OpenPBS has not been maintained in a long time. I upgraded to Torque when OpenPBS stopped being supported about 2004 if I recall correctly. Torque is available from http://www.adaptivecomputing.com/products/torque.php >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Milind >Sent: Friday, February 03, 2012 11:57 AM >To: torqueusers at supercluster.org >Subject: [torqueusers] Maui does not know queue to node map? - queue >system is failing, please HELP ! > >Hello, > >I am a cluster administrator at >the University of Wisconsin-Madison. At our cluster we have Maui >(3.2.5), OpenPBS 2.3 on the ROCKS 5.3 system. >For last few days, our queue system has been haywire : the PBS >accepts jobs and puts them in right queues, but the scheduler >somehow does something in the middle, and the job ends up on a >'wrong' compute node (which is not supposed to be in that queue), >all the while PBS still lists that job as running under the right >queue. > >example, PBS shows this: > >Job id??????????????????? Name???????????? User??????????? Time Use >S Queue >------------------------- ---------------- --------------- -------- >- ----- >60606.bardeen???????????? Cu1_a60_mov????? ??????? 00:52:05 R >fast > >but the job is on a compute node which is not at all in the queue >"fast" ! The pbs nodelist (/opt/torque/server_priv/nodes ) is all >fine, no errors in maui logs. >In pbs logs, I get this message > >?10:54:19;0008;PBS_Server;Job;60606.bardeen.msae.wisc.edu;Job >Modified at request of maui at bardeen.msae.wisc.edu > >My guess is that maui is doing something wrong / it does not know >the correct queue - to - node mapping. > >Can someone suggest what is going on or guide me to solve this issue >?? > >thanks !! > >Milind >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From gadre at wisc.edu Fri Feb 3 12:13:32 2012 From: gadre at wisc.edu (Milind) Date: Fri, 03 Feb 2012 13:13:32 -0600 Subject: [torqueusers] Maui does not know queue to node map? - queue system is failing, please HELP ! In-Reply-To: <7730b26c1324af.4f2c31cf@wiscmail.wisc.edu> References: <7730b26c1324af.4f2c31cf@wiscmail.wisc.edu> Message-ID: <76208faa13491a.4f2bdd7c@wiscmail.wisc.edu> Hello, I am a cluster administrator at? the University of Wisconsin-Madison. At our cluster we have Maui (3.2.5), PBS 2.4.6 on the ROCKS 5.3 system.? (sorry I wrote OpenPBS last email) For last few days, our queue system has been haywire : the PBS accepts jobs and puts them in right queues, but the scheduler somehow does something in the middle, and the job ends up on a 'wrong' compute node (which is not supposed to be in that queue), all the while PBS still lists that job as running under the right queue. example, PBS shows this: Job id??????????????????? Name???????????? User??????????? Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 60606.bardeen???????????? Cu1_a60_mov????? ??????? 00:52:05 R fast ???? but the job is on a compute node which is not at all in the queue "fast" ! The pbs nodelist (/opt/torque/server_priv/nodes ) is all fine, no errors in maui logs.? In pbs logs, I get this message ?10:54:19;0008;PBS_Server;Job;60606.bardeen.msae.wisc.edu;Job Modified at request of maui at bardeen.msae.wisc.edu My guess is that maui is doing something wrong / it does not know the correct queue - to - node mapping. Can someone suggest what is going on or guide me to solve this issue ?? thanks !! Milind From Gareth.Williams at csiro.au Fri Feb 3 14:45:24 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Sat, 4 Feb 2012 08:45:24 +1100 Subject: [torqueusers] requesting gpus In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD2@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DC9@exvic-mbx04.nexus.csiro.au> <4F2B76C6.2080104@vpac.org> <007DECE986B47F4EABF823C1FBB19C620102CDD74DD2@exvic-mbx04.nexus.csiro.au> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD4@exvic-mbx04.nexus.csiro.au> Matt Ismail at Warwick in the UK knew the problem/solution. > I reported this issue to Adaptive in August last year and it got fixed in torque-3.0.3-snap.201111071556. From the CHANGELOG: "Fixed a problem in qsub where you could not submit a job in interactive mode with gpus in the resource list." > If it is the same issue you're seeing it'll only be affecting interactive job submissions, i.e. qsub -I. I can confirm that in our current setup non-interactive jobs are OK - and we'll upgrade to make interactive jobs work too. Thanks, Gareth > -----Original Message----- > From: Gareth.Williams at csiro.au [mailto:Gareth.Williams at csiro.au] > Sent: Friday, 3 February 2012 8:08 PM > To: torqueusers at supercluster.org > Subject: Re: [torqueusers] requesting gpus > > > -----Original Message----- > > From: Craig West [mailto:cwest at vpac.org] > > Sent: Friday, 3 February 2012 4:55 PM > > To: torqueusers at supercluster.org > > Subject: Re: [torqueusers] requesting gpus > > > > > > Hi Gareth, > > > > > However when I run a job with the recommended syntax: > > > http://www.adaptivecomputing.com/resources/docs/torque/3-0- > > 3/3.7schedulinggpus.php > > > > > > I get: > > >> qsub -I -q viz -l nodes=1:ppn=1:gpus=1 > > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible > > nodes > > > > > > The torque version is 3.0.3-snap.201108261653 > > > > > > Note that this is _/not/_ the --enable-nvidia-gpus functionality. > > > Also note that the server has not been restarted. > > > The scheduler is moab but I'm pretty sure the job gets rejected > well > > > before moab comes into the picture. > > > > > > Does anyone have such a setup working or can anyone see what is > wrong > > > (or have an idea of where to look)? > > > > Your pbsnodes output looks correct and similar to our systems. > > > > Few questions for you: > > 1. What version of Moab are you running? > > 2. Does the Viz queue have the ability to schedule to that node? > > 3. What is in the "Configured Resources" line of "checknode n121"? > > It should have a "GPUS: 2" parameter. > > > > Cheers, > > Craig. > -snip- > > 1) Moab Version: 6.0.2 - due for an upgrade anytime > 2) yes - and I can get jobs there with gpus as a gres but that doesn't > count them right > 3) > checknode n121 | grep Configu > Configured Resources: PROCS: 12 MEM: 94G SWAP: 96G DISK: 137G GPUS: > 2 > > But I think moab is not getting to play a role. I've looked at logs but > confess that I've not turned up the logging level yet. > > Gareth From nt_mahmood at yahoo.com Sat Feb 4 03:58:49 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 4 Feb 2012 02:58:49 -0800 (PST) Subject: [torqueusers] changing the column width of "qstat" Message-ID: <1328353129.37528.YahooMailNeo@web111704.mail.gq1.yahoo.com> Dear all, Is it possible to change the column width on "qstat"? Currently the terminal width is large enough but the job names are shown like "tiger-st-5000-64". The complete job name is "tiger-st-5000-64-64". So it truncate the last three characters. // Naderan *Mahmood; From leggett at mcs.anl.gov Sat Feb 4 10:44:40 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Sat, 4 Feb 2012 11:44:40 -0600 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: References: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov> Message-ID: All jobs go through a routing queue. On Feb 3, 2012, at 11:44 AM, David Beer wrote: > I'm also curious - is this done through a routing queue or routing queues? Is it class remapping in Moab? It looks like it isn't qsub -q > > David > > On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett wrote: > submit_args = -A CI-MCB000083 -l walltime=48:00:00, > mppwidth=48 /lustre/beagle/linpyl/project.qsub > > On Feb 3, 2012, at 11:03 AM, David Beer wrote: > > > If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high. > > > > David > > > > On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote: > > I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools? > > > > On Feb 3, 2012, at 10:06 AM, David Beer wrote: > > > > > Ti, > > > > > > How are you submitting the jobs? I assume this is TORQUE 2.5.9? > > > > > > David > > > > > > On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote: > > > We've set queue limits that don't seem to be honored: > > > > > > sdb:~ # qstat | grep linpyl | grep batch | wc > > > 945 5670 82215 > > > > > > sdb:~ # qmgr -c "print queue batch" > > > # > > > # Create queues and set their attributes. > > > # > > > # > > > # Create and define queue batch > > > # > > > create queue batch > > > set queue batch queue_type = Execution > > > set queue batch max_user_queuable = 500 > > > set queue batch resources_min.mppwidth = 1 > > > set queue batch resources_default.mppwidth = 24 > > > set queue batch resources_default.walltime = 00:10:00 > > > set queue batch acl_group_enable = False > > > set queue batch resources_available.nodes = 726 > > > set queue batch enabled = True > > > set queue batch started = True > > > > > > How would it be possible for a user to have 945 jobs in the queue when the limit should be 500? > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > > > > > > -- > > > David Beer | Software Engineer > > > Adaptive Computing > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > -- > > David Beer | Software Engineer > > Adaptive Computing > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120204/12096961/attachment-0001.bin From cholam20 at yahoo.co.in Sat Feb 4 14:29:01 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Sun, 5 Feb 2012 02:59:01 +0530 (IST) Subject: [torqueusers] I am finally became Boss. Message-ID: <1328390941.64236.androidMobile@web137304.mail.in.yahoo.com>

ive had so much on my mind this totally took me by surprise it was time to start a new chapter!
http://e-muzyk.freehost.pl/newsjournal/77JasonMiller/ now I feel completed
consider trying it for yourself
see you soon...

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120205/6e87c85c/attachment.html From Gareth.Williams at csiro.au Sun Feb 5 21:50:16 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 6 Feb 2012 15:50:16 +1100 Subject: [torqueusers] node table is corrupt at index 0 Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DE1@exvic-mbx04.nexus.csiro.au> Hi, Is anybody familiar with the following server log entry or should I go look at source code :-( We are having some issues with bad client connections and I went looking at logs. I don't think it is actually related but can't be sure. There's only been a few recently except one day a few weeks ago there were 11545! Gareth PBS_Server;Svr;WARNING;ALERT: node table is corrupt at index 0 From pat.callahan at gd-ais.com Sun Feb 5 09:15:56 2012 From: pat.callahan at gd-ais.com (Callahan, Patrick M.) Date: Sun, 5 Feb 2012 11:15:56 -0500 Subject: [torqueusers] changing column width Message-ID: <6DDE2978C880C64EB5323275164A30886335144CF2@EADC-E-MABPRD01.ad.gd-ais.com> You may change the column width (job name) in the source code and recompile. I have done that with the versions we use. Look at either MAXNAMELEN in pbs_ifl.h or PBS_JOBBASE in server_limits.h. I don't have the source code in front of me so ymmv. Patrick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120205/e3f50827/attachment.html From s.breedveld at erasmusmc.nl Sun Feb 5 13:03:12 2012 From: s.breedveld at erasmusmc.nl (Sebastiaan Breedveld) Date: Sun, 05 Feb 2012 21:03:12 +0100 Subject: [torqueusers] unsubscribe Message-ID: <4F2EE080.8030008@erasmusmc.nl> unsubscribe -- Sebastiaan Breedveld, MSc. Ph.D. student Erasmus MC - Daniel den Hoed Cancer Center Department of Radiation Oncology Groene Hilledijk 301 3075 EA Rotterdam The Netherlands Phone: +31 10 7042693 Room: Gs-20 -------------- next part -------------- A non-text attachment was scrubbed... Name: s_breedveld.vcf Type: text/x-vcard Size: 365 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120205/cc9dd349/attachment.vcf From dbeer at adaptivecomputing.com Mon Feb 6 10:19:43 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 6 Feb 2012 10:19:43 -0700 Subject: [torqueusers] node table is corrupt at index 0 In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DE1@exvic-mbx04.nexus.csiro.au> References: <007DECE986B47F4EABF823C1FBB19C620102CDD74DE1@exvic-mbx04.nexus.csiro.au> Message-ID: To me this looks like a non-issue, but the message comes from a function logging bad client connections, so perhaps that is your relation. The error is logged if a node slot or that nodes addresses are NULL, but this condition is permitted elsewhere in the code. I would work on solving the client connections and then see if this causes you issues. David On Sun, Feb 5, 2012 at 9:50 PM, wrote: > Hi, > > Is anybody familiar with the following server log entry or should I go > look at source code :-( > We are having some issues with bad client connections and I went looking > at logs. I don't think it is actually related but can't be sure. There's > only been a few recently except one day a few weeks ago there were 11545! > > Gareth > > PBS_Server;Svr;WARNING;ALERT: node table is corrupt at index 0 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/b113becc/attachment.html From jjc at iastate.edu Mon Feb 6 10:56:54 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 6 Feb 2012 17:56:54 +0000 Subject: [torqueusers] Using Torque as a meta-scheduler over several clusters; any advice? Message-ID: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu> Currently, we use a different head node for each cluster we deploy. The head node runs the server and scheduler. I've been asked if it possible to schedule several clusters from a single login node. From a hardware standpoint, one could use virtual machines to accomplish this, but the idea was to somehow direct all the jobs from a single login node to somehow make it easier for users. What is being described sounds a lot like a grid. Has anyone done this? Can schedulers run on the "head node" of each clusters with a single pbs_server running on the master head node to interact with the users? Is the way to do this to set specific properties on the nodes from specific clusters (cluster1, cluster2, ...) to use MAUI and to have need_nodes set for different queues small_cluster1 ... ? Has someone "rolled their own" meta-scheduler? Thanks, - Jim James Coyle, PhD High Performance Computing Group 115 Durham Center Iowa State Univ. phone: (515)-294-2099 Ames, Iowa 50011 web: http://jjc.public.iastate.edu/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/e835ac4b/attachment-0001.html From dbeer at adaptivecomputing.com Mon Feb 6 10:59:42 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 6 Feb 2012 10:59:42 -0700 Subject: [torqueusers] Using Torque as a meta-scheduler over several clusters; any advice? In-Reply-To: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu> Message-ID: James, I don't know about a free version, but Moab has done grid scheduling for years. David On Mon, Feb 6, 2012 at 10:56 AM, Coyle, James J [ITACD] wrote: > **** > > Currently, we use a different head node for each cluster we deploy.*** > * > > The head node runs the server and scheduler. I?ve been asked if it > possible to **** > > schedule several clusters from a single login node. **** > > From a hardware standpoint, one could use virtual machines to > accomplish**** > > this, but the idea was to somehow direct all the jobs from a single login > node**** > > to somehow make it easier for users.**** > > ** ** > > What is being described sounds a lot like a grid.**** > > ** ** > > Has anyone done this?**** > > ** ** > > Can schedulers run on the ?head node? of each clusters with a single > pbs_server**** > > running on the master head node to interact with the users? **** > > ** ** > > Is the way to do this to set specific properties on the nodes from > specific clusters**** > > (cluster1, cluster2, ?) to use MAUI and to have need_nodes set for > different queues**** > > small_cluster1 ? ?**** > > ** ** > > Has someone ?rolled their own? meta-scheduler?**** > > ** ** > > Thanks,**** > > **- **Jim**** > > ** ** > > James Coyle, PhD**** > > High Performance Computing Group **** > > 115 Durham Center **** > > Iowa State Univ. phone: (515)-294-2099**** > > Ames, Iowa 50011 web: http://jjc.public.iastate.edu/**** > > ** ** > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/4f9b76e0/attachment.html From mej at lbl.gov Mon Feb 6 13:21:10 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 6 Feb 2012 12:21:10 -0800 Subject: [torqueusers] Using Torque as a meta-scheduler over several clusters; any advice? In-Reply-To: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu> References: <242421BFAF465844BE24EB90BB97E22101948F66@ITSDAG1D.its.iastate.edu> Message-ID: <20120206202109.GI2104@lbl.gov> On Monday, 06 February 2012, at 17:56:54 (+0000), Coyle, James J [ITACD] wrote: > Currently, we use a different head node for each cluster we deploy. > The head node runs the server and scheduler. I've been asked if it possible to > schedule several clusters from a single login node. > From a hardware standpoint, one could use virtual machines to accomplish > this, but the idea was to somehow direct all the jobs from a single login node > to somehow make it easier for users. > > What is being described sounds a lot like a grid. > > Has anyone done this? We have a single instance of TORQUE and Moab governing roughly 20 clusters, but you can also use Moab in a grid scenario as master/slaves or equal peers. All our "supercluster" clusters all share common interactive nodes, login gateway, NFS- and Lustre-based storage, and master node. > Can schedulers run on the "head node" of each clusters with a single pbs_server > running on the master head node to interact with the users? > > Is the way to do this to set specific properties on the nodes from specific clusters > (cluster1, cluster2, ...) to use MAUI and to have need_nodes set for different queues > small_cluster1 ... ? That's certainly one way to do it. We have one or more queues for each cluster, and the ACLs are set up within TORQUE and Moab to restrict access to those users/groups who own each cluster. > Has someone "rolled their own" meta-scheduler? You're likely to spend a lot more money doing this than it would cost for a Moab license. I don't know of any meta-scheduling packages which currently exist, but Moab's architecture supports a wide variety of functionality in this vein. HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From dbeer at adaptivecomputing.com Mon Feb 6 16:48:30 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Mon, 6 Feb 2012 16:48:30 -0700 Subject: [torqueusers] Updated Beta Build Message-ID: All, More fixes have been made for the beta, most importantly this build has fixes to the gpu reporting process (for nvidia-enabled gpus). http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201202061640.tar.gz -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120206/5379f81d/attachment.html From cholam20 at yahoo.co.in Mon Feb 6 20:35:08 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Tue, 7 Feb 2012 09:05:08 +0530 (IST) Subject: [torqueusers] Fwd: Nice opportunity. Message-ID: <1328585708.17902.androidMobile@web137302.mail.in.yahoo.com>

hey.
ive learned that things dont aways work out as planned I thought this would intrigue you despite the circumstances I remained hopeful.
http://www.kbsdenoorderborch.nl/currentevents/30ColinJackson/ now nobody disrespects me
you would excell at this

talk to you later!

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120207/88f69d81/attachment.html From jascha.wang at gmail.com Tue Feb 7 02:33:12 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Tue, 7 Feb 2012 17:33:12 +0800 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option Message-ID: I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option: #!/bin/sh #PBS -N simple-job #PBS -l procs=3 #PBS -q fluent #PBS -d /opt/share/job cd $PBS_O_WORKDIR date sleep 30 date The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the setting is shown below: # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Create and define queue fluent # create queue fluent set queue fluent queue_type = Execution set queue fluent acl_host_enable = False set queue fluent acl_hosts = cnode01 set queue fluent enabled = True set queue fluent started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = snode01 set server acl_roots = root@* set server managers = root at snode01 set server operators = root at snode01 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server auto_node_np = True set server next_job_number = 94 set server display_job_server_suffix = False The job should use a single node 'cnode01' , while the allocated node contains another node. see part of 'qstat -f' output: exec_host = snode01/1+snode01/0+cnode01/0 ... Resource_List.neednodes = cnode01 Resource_List.procs = 3 Can anyone give me some suggestion, it'll be greatly appreciated. Xiangqian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120207/a8f33cb5/attachment-0001.html From sm4082 at nyu.edu Tue Feb 7 06:40:45 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Tue, 7 Feb 2012 08:40:45 -0500 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: References: Message-ID: <434A0810-1C69-408F-9372-484586F7CF4D@nyu.edu> Hi, Instead of using > set queue fluent acl_host_enable = False > set queue fluent acl_hosts = cnode01 I set a feature to the node I wanted my jobs to run or wanted it to be under a special queue, I gave a certain feature to the nodes and put it in the pbs script like this: #PBS -l feature= Moab can put the jobs on the nodes with those features. I'm not sure how maui does it. I have a qsub wrapper that adds this feature line depending on users' requests. To give features to nodes, I used qmgr -c 'set node properties += ' For example, our p48 nodes have features like chassis0, chassis1, etc to indicate the chassis they belong to. Since we are asking for a specific queue with specific features, jobs always go onto right nodes with right feature. Sreedhar. On Feb 7, 2012, at 4:33 AM, Xiangqian Wang wrote: > I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option: > > #!/bin/sh > #PBS -N simple-job > #PBS -l procs=3 > #PBS -q fluent > #PBS -d /opt/share/job > cd $PBS_O_WORKDIR > date > sleep 30 > date > > The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the setting is shown below: > > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Create and define queue fluent > # > create queue fluent > set queue fluent queue_type = Execution > set queue fluent acl_host_enable = False > set queue fluent acl_hosts = cnode01 > set queue fluent enabled = True > set queue fluent started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = snode01 > set server acl_roots = root@* > set server managers = root at snode01 > set server operators = root at snode01 > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server auto_node_np = True > set server next_job_number = 94 > set server display_job_server_suffix = False > > The job should use a single node 'cnode01' , while the allocated node contains another node. see part of 'qstat -f' output: > > exec_host = snode01/1+snode01/0+cnode01/0 > ... > Resource_List.neednodes = cnode01 > Resource_List.procs = 3 > > Can anyone give me some suggestion, it'll be greatly appreciated. > > Xiangqian > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120207/56ef80c5/attachment.html From david at unistra.fr Tue Feb 7 14:27:29 2012 From: david at unistra.fr (R. David) Date: Tue, 7 Feb 2012 22:27:29 +0100 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: References: Message-ID: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr> Le 7 f?vr. 2012 ? 10:33, Xiangqian Wang a ?crit : > I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option: > > #!/bin/sh > #PBS -N simple-job > #PBS -l procs=3 > #PBS -q fluent > #PBS -d /opt/share/job > cd $PBS_O_WORKDIR > date > sleep 30 > date > > The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the setting is shown below: > [...] > The job should use a single node 'cnode01' , while the allocated node contains another node. see part of 'qstat -f' output: > > exec_host = snode01/1+snode01/0+cnode01/0 > ... > Resource_List.neednodes = cnode01 > Resource_List.procs = 3 > > Can anyone give me some suggestion, it'll be greatly appreciated. You should probably use the -l nodes=XX rather than procs=XX Depending on how you configured maui, you will have to write nodes=XX:ppn=YY or just nodes=XX Regards, R. David From jascha.wang at gmail.com Tue Feb 7 18:51:18 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Wed, 8 Feb 2012 09:51:18 +0800 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: <434A0810-1C69-408F-9372-484586F7CF4D@nyu.edu> References: <434A0810-1C69-408F-9372-484586F7CF4D@nyu.edu> Message-ID: it seems that '-l feature' option has no effect for maui-3.2.6p21, the jobs runs on nodes without the feature requested. 2012/2/7 Sreedhar Manchu > Hi, > > Instead of using > > set queue fluent acl_host_enable = False > set queue fluent acl_hosts = cnode01 > > > I set a feature to the node I wanted my jobs to run or wanted it to be > under a special queue, I gave a certain feature to the nodes and put it in > the pbs script like this: > > #PBS -l feature= > > Moab can put the jobs on the nodes with those features. I'm not sure how > maui does it. I have a qsub wrapper that adds this feature line depending > on users' requests. > > To give features to nodes, I used > > qmgr -c 'set node properties += ' > > For example, our p48 nodes have features like chassis0, chassis1, etc to > indicate the chassis they belong to. Since we are asking for a specific > queue with specific features, jobs always go onto right nodes with right > feature. > > Sreedhar. > > On Feb 7, 2012, at 4:33 AM, Xiangqian Wang wrote: > > I failed to test queue to node mapping feature of torque/maui system, I > use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs > option: > > #!/bin/sh > #PBS -N simple-job > #PBS -l procs=3 > #PBS -q fluent > #PBS -d /opt/share/job > cd $PBS_O_WORKDIR > date > sleep 30 > date > > The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the > setting is shown below: > > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Create and define queue fluent > # > create queue fluent > set queue fluent queue_type = Execution > set queue fluent acl_host_enable = False > set queue fluent acl_hosts = cnode01 > set queue fluent enabled = True > set queue fluent started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = snode01 > set server acl_roots = root@* > set server managers = root at snode01 > set server operators = root at snode01 > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server auto_node_np = True > set server next_job_number = 94 > set server display_job_server_suffix = False > > The job should use a single node 'cnode01' , while the allocated node > contains another node. see part of 'qstat -f' output: > > exec_host = snode01/1+snode01/0+cnode01/0 > ... > Resource_List.neednodes = cnode01 > Resource_List.procs = 3 > > Can anyone give me some suggestion, it'll be greatly appreciated. > > Xiangqian > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > --- > Sreedhar Manchu > HPC Support Specialist > New York University > 251 Mercer Street > New York, NY 10012-1110 > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120208/6fa0eac3/attachment.html From jascha.wang at gmail.com Tue Feb 7 19:02:50 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Wed, 8 Feb 2012 10:02:50 +0800 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr> References: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr> Message-ID: Hello David, I also test the '-l nodes:ppn' option, the node allocation is right, as you said. For '-l procs' option and queue to node mapping scenario, I wonder if it is possible to configure it to work? Xiangqian 2012/2/8 R. David > > Le 7 f?vr. 2012 ? 10:33, Xiangqian Wang a ?crit : > > > I failed to test queue to node mapping feature of torque/maui system, I > use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs > option: > > > > #!/bin/sh > > #PBS -N simple-job > > #PBS -l procs=3 > > #PBS -q fluent > > #PBS -d /opt/share/job > > cd $PBS_O_WORKDIR > > date > > sleep 30 > > date > > > > The 'fluent' queue is mapped to a node 'cnode01' with 4 processors, the > setting is shown below: > > > > [...] > > > The job should use a single node 'cnode01' , while the allocated node > contains another node. see part of 'qstat -f' output: > > > > exec_host = snode01/1+snode01/0+cnode01/0 > > ... > > Resource_List.neednodes = cnode01 > > Resource_List.procs = 3 > > > > Can anyone give me some suggestion, it'll be greatly appreciated. > > You should probably use the -l nodes=XX rather than procs=XX > > Depending on how you configured maui, you will have to write > nodes=XX:ppn=YY or just nodes=XX > > > Regards, > R. David > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120208/b73dc603/attachment-0001.html From Gareth.Williams at csiro.au Tue Feb 7 19:24:30 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 8 Feb 2012 13:24:30 +1100 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: References: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au> Sounds like there is a bug. You might be able to work around it by using the queue to node mapping described here: http://www.supercluster.org/pipermail/mauiusers/2012-February/004839.html Gareth > I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option: > > #!/bin/sh > #PBS -N simple-job > #PBS -l procs=3 > #PBS -q fluent > #PBS -d /opt/share/job > cd $PBS_O_WORKDIR > date > sleep 30 > date -snip- ______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From U0850037 at hud.ac.uk Wed Feb 8 06:57:20 2012 From: U0850037 at hud.ac.uk (Ibad Kureshi U0850037) Date: Wed, 8 Feb 2012 13:57:20 +0000 Subject: [torqueusers] Step Change in Job Arrays Message-ID: Hello, I was wondering is someone could tell me how to adjust the step size in a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small cluster and our users want to submit arrays. One the SGE and the Moab/Torque based systems $ -t 1-20:2 or #PBS -t 1-20:2 respectively, gives them 10 jobs with even ID numbers. How can this be done with Torque? It throws out "qsub: Bad Job Array Request" error Have not been able to find much literature on this. Thanks -Ibad Kureshi --- This transmission is confidential and may be legally privileged. If you receive it in error, please notify us immediately by e-mail and remove it from your system. If the content of this e-mail does not relate to the business of the University of Huddersfield, then we do not endorse it and will accept no liability. From glen.beane at gmail.com Wed Feb 8 07:02:52 2012 From: glen.beane at gmail.com (Glen Beane) Date: Wed, 8 Feb 2012 09:02:52 -0500 Subject: [torqueusers] Step Change in Job Arrays In-Reply-To: References: Message-ID: On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037 wrote: > Hello, > > I was wondering is someone could tell me how to adjust the step size in a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small cluster and our users want to submit arrays. > > One the SGE and the Moab/Torque based systems > > $ -t 1-20:2 > > or > > #PBS -t 1-20:2 > > respectively, gives them 10 jobs with even ID numbers. > > How can this be done with Torque? It throws out "qsub: Bad Job Array Request" error > > Have not been able to find much literature on this. > > Thanks this is not currently supported, but it is a great feature request. unfortunately the only option would be to explicitly specify each array ID: #PBS -t 2,4,6,8,10 ...20 From acaird at umich.edu Wed Feb 8 07:28:16 2012 From: acaird at umich.edu (Andrew Caird) Date: Wed, 8 Feb 2012 09:28:16 -0500 Subject: [torqueusers] Step Change in Job Arrays In-Reply-To: References: Message-ID: On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane wrote: > On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037 > wrote: > > Hello, > > > > I was wondering is someone could tell me how to adjust the step size in > a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small > cluster and our users want to submit arrays. > > > > One the SGE and the Moab/Torque based systems > > > > $ -t 1-20:2 > > > > or > > > > #PBS -t 1-20:2 > > > > respectively, gives them 10 jobs with even ID numbers. > > > > How can this be done with Torque? It throws out "qsub: Bad Job Array > Request" error > > > > Have not been able to find much literature on this. > > > > Thanks > > > this is not currently supported, but it is a great feature request. > > unfortunately the only option would be to explicitly specify each array ID: > > #PBS -t 2,4,6,8,10 ...20 Or: qsub -t `seq -s, 2 2 20` pbsfile.txt in case you don't want to type all the numbers. --andy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120208/d6dad0a5/attachment.html From U0850037 at hud.ac.uk Wed Feb 8 07:47:36 2012 From: U0850037 at hud.ac.uk (Ibad Kureshi U0850037) Date: Wed, 8 Feb 2012 14:47:36 +0000 Subject: [torqueusers] Step Change in Job Arrays In-Reply-To: References: , Message-ID: Thanks Glen, Andy Andy: Nice! -Ibad ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] On Behalf Of Andrew Caird [acaird at umich.edu] Sent: Wednesday, February 08, 2012 2:28 PM To: Torque Users Mailing List Subject: Re: [torqueusers] Step Change in Job Arrays On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane > wrote: On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037 > wrote: > Hello, > > I was wondering is someone could tell me how to adjust the step size in a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small cluster and our users want to submit arrays. > > One the SGE and the Moab/Torque based systems > > $ -t 1-20:2 > > or > > #PBS -t 1-20:2 > > respectively, gives them 10 jobs with even ID numbers. > > How can this be done with Torque? It throws out "qsub: Bad Job Array Request" error > > Have not been able to find much literature on this. > > Thanks this is not currently supported, but it is a great feature request. unfortunately the only option would be to explicitly specify each array ID: #PBS -t 2,4,6,8,10 ...20 Or: qsub -t `seq -s, 2 2 20` pbsfile.txt in case you don't want to type all the numbers. --andy --- This transmission is confidential and may be legally privileged. If you receive it in error, please notify us immediately by e-mail and remove it from your system. If the content of this e-mail does not relate to the business of the University of Huddersfield, then we do not endorse it and will accept no liability. From R.M.Krug at gmail.com Thu Feb 9 02:16:07 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Thu, 09 Feb 2012 10:16:07 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi assuming I have cluster of 10 nodes (node01, ... node10), of which I am not the administrator. Some nodes are setup slightly different, so that a certain job only runs on nodes node01 to node05. So I would like to submit an array job and specify "only use the node01, node02, node03, node04 or node05 to run the each individual job". How can I do that? I know that I can use -l to specify resource requirements, but if I specify nodes=..., *each* job will allocate *all* nodes for the job, which is not what I want - each individual job should run on one of the nodes. so: qsub the_script.sub -t 1-10 and how do I specify the nodes? Thanks, Rainer -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8zjtcACgkQoYgNqgF2egpfBwCfdntKs0vrjLQzJP5soVA0s4+5 Ui4An2IoPirSo0oBKd/CRWRmo1paHGD+ =1dMi -----END PGP SIGNATURE----- From jascha.wang at gmail.com Thu Feb 9 03:17:53 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Thu, 9 Feb 2012 18:17:53 +0800 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au> References: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr> <007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au> Message-ID: Thank you, Gareth It seems that adding standing reservation in maui config has the same effect to my torque config before. What I did is: add following configuration to maui.cfg: SRCFG[queue01] CLASSLIST=queue01 SRCFG[queue01] NODEFEATURES=queue01 SRCFG[queue01] PERIOD=INFINITY SRCFG[queue01] RESOURCES=PROCS:1 and create a queue named 'queue01', set feature 'queue01' to a node 'cnode01' When I submit job through queue01, the cnode01 is not allocated. Seems that it's a bug of maui. Xiangqian 2012/2/8 > Sounds like there is a bug. > > You might be able to work around it by using the queue to node mapping > described here: > http://www.supercluster.org/pipermail/mauiusers/2012-February/004839.html > > Gareth > > > I failed to test queue to node mapping feature of torque/maui system, I > use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs > option: > > > > #!/bin/sh > > #PBS -N simple-job > > #PBS -l procs=3 > > #PBS -q fluent > > #PBS -d /opt/share/job > > cd $PBS_O_WORKDIR > > date > > sleep 30 > > date > > -snip- > ______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/8110218d/attachment-0001.html From jascha.wang at gmail.com Thu Feb 9 03:33:21 2012 From: jascha.wang at gmail.com (Xiangqian Wang) Date: Thu, 9 Feb 2012 18:33:21 +0800 Subject: [torqueusers] How to set the calling interval of prologue script when job queued by it Message-ID: I need a prologue script to ensure some preparation is done before my job starts, here is my simple script file: #!/bin/sh if [ -f /opt/share/prepared ] then echo `date` ": ready" exit 0 fi echo `date` ": not ready" exit 2 Using the following job script, i can prevent the job from running before the preparation file comes up. #!/bin/sh #PBS -N prologure-job #PBS -l nodes=snode01 #PBS -l prologue=/opt/share/shell/prologue.scs #PBS -q batch #PBS -d /opt/share/job #PBS -p 10 #PBS -o $PBS_JOBID.o #PBS -e $PBS_JOBID.e # cd $PBS_O_WORKDIR date ping localhost -c 20 date But what i'm not satisfied is that the prologue script is called frequently when the job is queued, approximately 1 second after the other, see my job output file: Thu Feb 9 18:07:32 CST 2012 : not ready Thu Feb 9 18:07:33 CST 2012 : not ready Thu Feb 9 18:07:34 CST 2012 : not ready Thu Feb 9 18:07:35 CST 2012 : not ready Thu Feb 9 18:07:36 CST 2012 : not ready Thu Feb 9 18:07:37 CST 2012 : not ready Thu Feb 9 18:07:38 CST 2012 : not ready Thu Feb 9 18:07:39 CST 2012 : not ready Thu Feb 9 18:07:40 CST 2012 : not ready Thu Feb 9 18:07:41 CST 2012 : not ready Thu Feb 9 18:07:42 CST 2012 : not ready ... and the job state switch between 'Q' and 'R', irregularly. Now what i want to know is: 1. how to set a longer interval of calling the prologure script, maybe 5+ minutes is OK? 2. is it normal that the job state switch between 'Q' and 'R', shouldn't it always be 'Q'? Thanks for your concern. Xiangqian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/702fdce3/attachment.html From Gareth.Williams at csiro.au Thu Feb 9 03:36:16 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Thu, 9 Feb 2012 21:36:16 +1100 Subject: [torqueusers] queue to node mapping is wrong when use '-l procs' option In-Reply-To: References: <5692CC52-F8C0-4002-A129-07EF722F6B74@unistra.fr> <007DECE986B47F4EABF823C1FBB19C620102CDD74DF5@exvic-mbx04.nexus.csiro.au> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E05@exvic-mbx04.nexus.csiro.au> Xiangqian, You also need: CLASSCFG[queue01] DEFAULT.FEATURES=queue01 To make the queue go to the nodes with the feature. The SRCFG part stops other jobs going to the nodes with the feature. I guess the job went somewhere else. If it was stuck you could run checkjob -v to see why (I think that is OK in maui). This is turning into a maui discussion so you might want to move to the mauiusers list if you need to continue. Also, plain text is best for mailing lists. Html is hard to quote sensibly - sorry for the top post :) Gareth From: Xiangqian Wang [mailto:jascha.wang at gmail.com] Sent: Thursday, 9 February 2012 9:18 PM To: Torque Users Mailing List Subject: Re: [torqueusers] queue to node mapping is wrong when use '-l procs' option Thank you, Gareth It seems that adding standing reservation in maui config has the same effect to my torque config before. What I did is: add following configuration to maui.cfg: SRCFG[queue01] CLASSLIST=queue01 SRCFG[queue01] NODEFEATURES=queue01 SRCFG[queue01] PERIOD=INFINITY SRCFG[queue01] RESOURCES=PROCS:1 and create a queue named 'queue01', set feature 'queue01' to a node 'cnode01' When I submit job through queue01, the cnode01 is not allocated. Seems that it's a bug of maui. Xiangqian 2012/2/8 Sounds like there is a bug. You might be able to work around it by using the queue to node mapping described here: http://www.supercluster.org/pipermail/mauiusers/2012-February/004839.html Gareth > I failed to test queue to node mapping feature of torque/maui system, I use torque 2.5.8 and maui 3.2.6p21. the simple job script contains a procs option: > > #!/bin/sh > #PBS -N simple-job > #PBS -l procs=3 > #PBS -q fluent > #PBS -d /opt/share/job > cd $PBS_O_WORKDIR > date > sleep 30 > date -snip- ______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/cdc4b62b/attachment.html From Michael.Zulauf at iberdrolaren.com Thu Feb 9 11:30:09 2012 From: Michael.Zulauf at iberdrolaren.com (Zulauf, Michael) Date: Thu, 9 Feb 2012 10:30:09 -0800 Subject: [torqueusers] problem with jobs sharing cores Message-ID: Hi all. . . I apologize if this message appears more than once - there was an issue with my email address and list registration (which I hope is now fixed), and so I'm having to resend this. . . Anyway, where I work, we've had a problem for a while that we haven't been able to resolve. I'm not certain of the cause - if it's related to Torque, or Maui, or something else. But here goes. . . We've got a small cluster of 16 nodes, each with dual hex-core processors. 12 cores per node, 192 cores total. The problem is that if I launch small jobs, where multiple jobs should be able to share a node without sharing cores, I instead get cores that are running more than one process, while other cores are idle. The primary executable is WRF (weather prediction model), but the problem occurs for other parallel codes. The codes have been built to utilize MPI (not OpenMP, or MPI/OpenMP). As an example, if I launch a series of jobs which request 4 cores each, I get 3 jobs assigned to each node. That should be fine, as each node has 12 cores, and there should be no need to share cores. Instead, I get 4 "overloaded" cores (each running 3 processes) and 8 idle cores. Obviously not an ideal situation. If I submit only a single small job, in which case it's alone on a node, then it runs great. Similarly, if I launch a large job which spans more than one node, it also works well - as long as it's not sharing nodes with other jobs. The problem only occurs (and always occurs) when parallel jobs share a node. BTW, the qsub command does not explicitly request specific cores, or anything like that. I'm not the administrator - just the primary user. The administrator (who was not previously familiar with Torque/Maui) has been struggling with this for a bit, and is rather busy with other duties, so I thought I'd check in here to see if anybody had suggestions I could pass along. Here are some specifics, as far as I know them: HP blade hardware dual Intel Xeon X5670 processors Infiniband interconnect (not an issue in this case?) the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly) Torque 3.0.2 mvapich2-1.7rc1 PGI7.2-5 compilers WRF 3.3.1 Any thoughts? I've probably left out relevant information. If so, please ask for clarification. Thanks, Mike -- Mike Zulauf Meteorologist, Lead Senior Asset Optimization Iberdrola Renewables 1125 NW Couch, Suite 700 Portland, OR 97209 Office: 503-478-6304 Cell: 503-913-0403 This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/ec5549a6/attachment-0001.html From christina.salls at noaa.gov Thu Feb 9 12:47:05 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 9 Feb 2012 14:47:05 -0500 Subject: [torqueusers] simple startup troubleshooting Message-ID: Hi all, I am new to Torque. In fact, I have just installed torque-2.5.9 (server) on the head node of a 20 node cluster and torque client and mom packages on the compute nodes. I used the Torque Administrator's Guide and the installation process seemed to proceed smoothly (on my second attempt). My first attempt was complicated by the fact that PBS was pre-installed on both the head node and server and seemed to be getting in my way because of processes that were already running and ports that were already in use. I removed everything I could find of the PBS installation and started from scratch. I am stuck at the point where I should be seeing my nodes as free, but they are showing up as down. I am looking for any clues in troubleshooting this problem. I don't know where to start. I am including some information to illustrate my setup. Thanks in advance, Christina Here is the output of the pbsnodes command [root at wings torque-packages]# pbsnodes -a n001 state = down np = 1 ntype = cluster gpus = 0 n002 state = down np = 1 ntype = cluster gpus = 0 n003 state = down np = 1 ntype = cluster gpus = 0 ..... It is the same for all 20 nodes. I truncated it for the sake of brevity. On the headnode: [root at wings server_priv]# ping n001 PING n001.default.domain (10.0.1.1) 56(84) bytes of data. 64 bytes from n001.default.domain (10.0.1.1): icmp_seq=1 ttl=64 time=0.193 ms 64 bytes from n001.default.domain (10.0.1.1): icmp_seq=2 ttl=64 time=0.189 ms [root at wings server_priv]# qmgr Max open servers: 10239 Qmgr: list server Server wings.glerl.noaa.gov server_state = Active scheduling = True total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 acl_hosts = wings.glerl.noaa.gov default_queue = batch log_events = 511 mail_from = adm scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 mom_job_sync = True pbs_version = 2.5.9 keep_completed = 300 next_job_number = 0 net_counter = 4 4 4 Qmgr: list node n001 Node n001 state = down np = 1 ntype = cluster gpus = 0 Qmgr: print node n001 # # Create nodes and set their properties. # # # Create and define node n001 # create node n001 set node n001 state = down set node n001 np = 1 set node n001 ntype = cluster set node n001 gpus = 0 [root at wings server_priv]# ps -ef | grep pbs root 3925 1 0 Feb03 ? 00:03:00 /usr/local/sbin/pbs_mom -q -d /var/spool/torque root 7056 1 0 11:47 ? 00:00:02 pbs_server root 29031 7993 0 12:59 pts/29 00:00:00 grep pbs [root at wings torque-2.5.9]# qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = wings.glerl.noaa.gov set server managers = salls at wings.glerl.noaa.gov set server operators = salls at wings.glerl.noaa.gov set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 >From the compute nodes: root 15891 1 0 11:45 ? 00:00:00 pbs_mom root 16742 16709 0 13:11 pts/0 00:00:00 grep pbs [root at n001 ~]# ping wings PING wings.glerl.noaa.gov (192.94.173.9) 56(84) bytes of data. 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=1 ttl=64 time=0.093 ms 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=2 ttl=64 time=0.165 ms [root at n001 ~]# qmgr Max open servers: 10239 Qmgr: list server Server wings.glerl.noaa.gov server_state = Active scheduling = True total_jobs = 0 state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 acl_hosts = wings.glerl.noaa.gov default_queue = batch log_events = 511 mail_from = adm scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 mom_job_sync = True pbs_version = 2.5.9 keep_completed = 300 next_job_number = 0 net_counter = 6 5 4 Qmgr: list node n001 Node n001 state = down np = 1 ntype = cluster gpus = 0 [root at wings server_priv]# qmgr Max open servers: 10239 Qmgr: print node n001 # # Create nodes and set their properties. # # # Create and define node n001 # create node n001 set node n001 state = down set node n001 np = 1 set node n001 ntype = cluster set node n001 gpus = 0 I am not sure how to proceed at this point. Any help would be appreciated. I wasn't sure what other files or output to include. Let me know if any other information would be useful. -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/6e38106e/attachment.html From knielson at adaptivecomputing.com Thu Feb 9 13:19:18 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 09 Feb 2012 13:19:18 -0700 (MST) Subject: [torqueusers] simple startup troubleshooting In-Reply-To: Message-ID: ----- Original Message ----- > From: "Christina Salls" > To: torqueusers at supercluster.org > Cc: "help >> GLERL IT Help" > Sent: Thursday, February 9, 2012 12:47:05 PM > Subject: [torqueusers] simple startup troubleshooting > > > Hi all, > > > I am new to Torque. In fact, I have just installed torque-2.5.9 > (server) on the head node of a 20 node cluster and torque client and > mom packages on the compute nodes. I used the Torque Administrator's > Guide and the installation process seemed to proceed smoothly (on my > second attempt). My first attempt was complicated by the fact that > PBS was pre-installed on both the head node and server and seemed to > be getting in my way because of processes that were already running > and ports that were already in use. I removed everything I could > find of the PBS installation and started from scratch. I am stuck at > the point where I should be seeing my nodes as free, but they are > showing up as down. I am looking for any clues in troubleshooting > this problem. I don't know where to start. I am including some > information to illustrate my setup. > > > Thanks in advance, > > > Christina > > Christina, Something to check is the server_name file on each of the MOM nodes. This should have the host name of where pbs_server is running. That is the $TORQUE_HOME/server_name file. Ken From knielson at adaptivecomputing.com Thu Feb 9 13:27:59 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 09 Feb 2012 13:27:59 -0700 (MST) Subject: [torqueusers] problem with jobs sharing cores In-Reply-To: Message-ID: <0a281c3b-fb63-4720-965b-4a7a5e28eba0@mail> ----- Original Message ----- > From: "Michael Zulauf" > To: torqueusers at supercluster.org > Sent: Thursday, February 9, 2012 11:30:09 AM > Subject: [torqueusers] problem with jobs sharing cores > > > > > > Hi all. . . > > > > I apologize if this message appears more than once ? there was an > issue with my email address and list registration (which I hope is > now fixed), and so I?m having to resend this. . . > > > > Anyway, where I work, we?ve had a problem for a while that we haven?t > been able to resolve. I?m not certain of the cause - if it?s related > to Torque, or Maui, or something else. But here goes. . . > > > > We?ve got a small cluster of 16 nodes, each with dual hex-core > processors. 12 cores per node, 192 cores total. The problem is that > if I launch small jobs, where multiple jobs should be able to share > a node without sharing cores, I instead get cores that are running > more than one process, while other cores are idle. The primary > executable is WRF (weather prediction model), but the problem occurs > for other parallel codes. The codes have been built to utilize MPI > (not OpenMP, or MPI/OpenMP). > > You do not really schedule cores with Maui/TORQUE or any other scheduler/resource manager. However, there are ways to make sure you get unique cores for your job. In TORQUE use CPUSETs http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/3.5linuxcpusets.php. When you set the np count in the nodes file it is not physically tied to the number of processors on the node. It is really a count that says I have this many execution slots available on this node. By far most nodes are set to the number of cores available. Even then, however, when jobs are scheduled they are managed by the OS which will run the jobs anywhere it sees fit. CPUSETs allow the user to reserve 1 or more cores exclusively for their job. Their job will not run outside of the CPUSET and no other processes can use their CPUSET either. Ken From sm4082 at nyu.edu Thu Feb 9 13:34:26 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 9 Feb 2012 15:34:26 -0500 Subject: [torqueusers] simple startup troubleshooting In-Reply-To: References: Message-ID: <4917601D-FC9A-45AE-99CE-BE3FA3795B31@nyu.edu> Hi Christina, Recently, I upgraded torque to its latest version 2.5.10 on our clusters. The way our configuration setup was that our compute nodes couldn't talk to server node with the server name example.sit.nyu.edu. They should talk to example.local. So I had to put it in /opt/torque/mom_priv/config as $pbsserver example.local Please check your settings the way network is setup. The other thing I did was to restart the pbs_moms on all nodes and it took care of it. Because the way it was set up, immediately after node came alive with installation it was trying to talk to server with server name variable in /opt/torque (it couldn't read the config file because it was copied after reboot). Once I rebooted pbs_mom it picked it up from config and everything was fine. Sreedhar. On Feb 9, 2012, at 2:47 PM, Christina Salls wrote: > Hi all, > > I am new to Torque. In fact, I have just installed torque-2.5.9 (server) on the head node of a 20 node cluster and torque client and mom packages on the compute nodes. I used the Torque Administrator's Guide and the installation process seemed to proceed smoothly (on my second attempt). My first attempt was complicated by the fact that PBS was pre-installed on both the head node and server and seemed to be getting in my way because of processes that were already running and ports that were already in use. I removed everything I could find of the PBS installation and started from scratch. I am stuck at the point where I should be seeing my nodes as free, but they are showing up as down. I am looking for any clues in troubleshooting this problem. I don't know where to start. I am including some information to illustrate my setup. > > Thanks in advance, > > Christina > > Here is the output of the pbsnodes command > > [root at wings torque-packages]# pbsnodes -a > n001 > state = down > np = 1 > ntype = cluster > gpus = 0 > > n002 > state = down > np = 1 > ntype = cluster > gpus = 0 > > n003 > state = down > np = 1 > ntype = cluster > gpus = 0 > > ..... > > It is the same for all 20 nodes. I truncated it for the sake of brevity. > > On the headnode: > > [root at wings server_priv]# ping n001 > PING n001.default.domain (10.0.1.1) 56(84) bytes of data. > 64 bytes from n001.default.domain (10.0.1.1): icmp_seq=1 ttl=64 time=0.193 ms > 64 bytes from n001.default.domain (10.0.1.1): icmp_seq=2 ttl=64 time=0.189 ms > > [root at wings server_priv]# qmgr > Max open servers: 10239 > Qmgr: list server > Server wings.glerl.noaa.gov > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = wings.glerl.noaa.gov > default_queue = batch > log_events = 511 > mail_from = adm > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 2.5.9 > keep_completed = 300 > next_job_number = 0 > net_counter = 4 4 4 > Qmgr: list node n001 > Node n001 > state = down > np = 1 > ntype = cluster > gpus = 0 > Qmgr: print node n001 > # > # Create nodes and set their properties. > # > # > # Create and define node n001 > # > create node n001 > set node n001 state = down > set node n001 np = 1 > set node n001 ntype = cluster > set node n001 gpus = 0 > > > [root at wings server_priv]# ps -ef | grep pbs > root 3925 1 0 Feb03 ? 00:03:00 /usr/local/sbin/pbs_mom -q -d /var/spool/torque > root 7056 1 0 11:47 ? 00:00:02 pbs_server > root 29031 7993 0 12:59 pts/29 00:00:00 grep pbs > > [root at wings torque-2.5.9]# qmgr -c 'p s' > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = wings.glerl.noaa.gov > set server managers = salls at wings.glerl.noaa.gov > set server operators = salls at wings.glerl.noaa.gov > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > > From the compute nodes: > > root 15891 1 0 11:45 ? 00:00:00 pbs_mom > root 16742 16709 0 13:11 pts/0 00:00:00 grep pbs > > [root at n001 ~]# ping wings > PING wings.glerl.noaa.gov (192.94.173.9) 56(84) bytes of data. > 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=1 ttl=64 time=0.093 ms > 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=2 ttl=64 time=0.165 ms > > [root at n001 ~]# qmgr > Max open servers: 10239 > Qmgr: list server > Server wings.glerl.noaa.gov > server_state = Active > scheduling = True > total_jobs = 0 > state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = wings.glerl.noaa.gov > default_queue = batch > log_events = 511 > mail_from = adm > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 2.5.9 > keep_completed = 300 > next_job_number = 0 > net_counter = 6 5 4 > > Qmgr: list node n001 > Node n001 > state = down > np = 1 > ntype = cluster > gpus = 0 > [root at wings server_priv]# qmgr > Max open servers: 10239 > Qmgr: print node n001 > # > # Create nodes and set their properties. > # > # > # Create and define node n001 > # > create node n001 > set node n001 state = down > set node n001 np = 1 > set node n001 ntype = cluster > set node n001 gpus = 0 > > > I am not sure how to proceed at this point. Any help would be appreciated. I wasn't sure what other files or output to include. Let me know if any other information would be useful. > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/b6002ab7/attachment-0001.html From christina.salls at noaa.gov Thu Feb 9 14:23:10 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 9 Feb 2012 16:23:10 -0500 Subject: [torqueusers] simple startup troubleshooting In-Reply-To: References: Message-ID: > Something to check is the server_name file on each of the MOM nodes. This > should have the host name of where pbs_server is running. That is the > $TORQUE_HOME/server_name file. > > Ken > Thanks Ken, I think the server_name is good. Here is what I have: [root at n001 ~]# cd /var/spool [root at n001 spool]# ls abrt abrt-upload anacron at audit cron cups lpd mail plymouth postfix torque up2date [root at n001 spool]# cd torque [root at n001 torque]# ls aux checkpoint mom_logs mom_priv pbs_environment server_name server_name.new spool undelivered [root at n001 torque]# more server_name wings.glerl.noaa.gov [root at n001 torque]# ping wings.glerl.noaa.gov PING wings.glerl.noaa.gov (192.94.173.9) 56(84) bytes of data. 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=1 ttl=64 time=0.075 ms 64 bytes from wings.glerl.noaa.gov (192.94.173.9): icmp_seq=2 ttl=64 time=0.190 ms -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/1cd2e5bf/attachment.html From halmabrazi at idtdna.com Thu Feb 9 14:33:02 2012 From: halmabrazi at idtdna.com (Hakeem Almabrazi) Date: Thu, 9 Feb 2012 21:33:02 +0000 Subject: [torqueusers] submitting a job (interactively) issue Message-ID: Hi, I have tried to submit a job using the option -I and I got the message Qsub: waiting for job # to start Qsub: job # ready And that is it. If I qstat I got a message saying the job # is still "R" running ... It looks like I have lack of understanding on how to use this option but here is my submit job request: >qsub -l nodes=1 -N jobName -I -v "some parameters" shellScript If I run the above request without the -I option, it runs fine without any issue. Someone might ask the question, why I am running it "interactively"? Well, I want to force the program which issued the request to wait for the result and do something with it after that. Thank you for your help. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/4a70693f/attachment.html From christina.salls at noaa.gov Thu Feb 9 14:39:27 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 9 Feb 2012 16:39:27 -0500 Subject: [torqueusers] simple startup troubleshooting In-Reply-To: References: Message-ID: Ok, I got a suggestion and it worked!! This is how I got my nodes to a free state: Qmgr: s n n001,n002,n003,n004,n005,n006,n007,n008,n009,n010,n011,n012,n013,n014,n015,n016,n017,n018,n019,n020 state=free Qmgr: p n n001,n002,n003 # # Create nodes and set their properties. # # # Create and define node n001 # create node n001 set node n001 state = free set node n001 np = 1 set node n001 ntype = cluster set node n001 gpus = 0 # # Create nodes and set their properties. # # # Create and define node n002 # create node n002 set node n002 state = free set node n002 np = 1 set node n002 ntype = cluster set node n002 gpus = 0 # # Create nodes and set their properties. # # # Create and define node n003 # create node n003 set node n003 state = free set node n003 np = 1 set node n003 ntype = cluster set node n003 gpus = 0 Qmgr: quit [root at wings ~]# pbsnodes -a n001 state = free np = 1 ntype = cluster gpus = 0 n002 state = free np = 1 ntype = cluster gpus = 0 n003 state = free np = 1 ntype = cluster gpus = 0 n004 state = free np = 1 ntype = cluster gpus = 0 n005 state = free np = 1 ntype = cluster gpus = 0 n006 state = free np = 1 ntype = cluster gpus = 0 n007 state = free np = 1 ntype = cluster gpus = 0 n008 state = free np = 1 ntype = cluster gpus = 0 n009 state = free np = 1 ntype = cluster gpus = 0 n010 state = free np = 1 ntype = cluster gpus = 0 n011 state = free np = 1 ntype = cluster gpus = 0 n012 state = free np = 1 ntype = cluster gpus = 0 n013 state = free np = 1 ntype = cluster gpus = 0 n014 state = free np = 1 ntype = cluster gpus = 0 n015 state = free np = 1 ntype = cluster gpus = 0 n016 state = free np = 1 ntype = cluster gpus = 0 n017 state = free np = 1 ntype = cluster gpus = 0 n018 state = free np = 1 ntype = cluster gpus = 0 n019 state = free np = 1 ntype = cluster gpus = 0 n020 state = free np = 1 ntype = cluster gpus = 0 Now I have a lot more work to do to figure out submitting jobs, etc.... Thanks to everyone who responded!! Christina -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/7ccfc127/attachment.html From leggett at mcs.anl.gov Thu Feb 9 14:41:12 2012 From: leggett at mcs.anl.gov (Ti Leggett) Date: Thu, 9 Feb 2012 15:41:12 -0600 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: References: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov> Message-ID: <88388003-8D26-4027-931B-19777974EFCC@mcs.anl.gov> Is there more configuration I need to do to make this effective with a routing queue? On Feb 4, 2012, at 11:44 AM, Ti Leggett wrote: > All jobs go through a routing queue. > > On Feb 3, 2012, at 11:44 AM, David Beer wrote: > >> I'm also curious - is this done through a routing queue or routing queues? Is it class remapping in Moab? It looks like it isn't qsub -q >> >> David >> >> On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett wrote: >> submit_args = -A CI-MCB000083 -l walltime=48:00:00, >> mppwidth=48 /lustre/beagle/linpyl/project.qsub >> >> On Feb 3, 2012, at 11:03 AM, David Beer wrote: >> >>> If you qstat -f a few of the jobs you can see the submit arguments. At higher log levels the entire job submission is there, but I don't known if your log levels would be that high. >>> >>> David >>> >>> On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett wrote: >>> I'm assuming using qsub, but it's other users doing this so I'm not 100% sure. Is there a way to find out from logs or other tools? >>> >>> On Feb 3, 2012, at 10:06 AM, David Beer wrote: >>> >>>> Ti, >>>> >>>> How are you submitting the jobs? I assume this is TORQUE 2.5.9? >>>> >>>> David >>>> >>>> On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett wrote: >>>> We've set queue limits that don't seem to be honored: >>>> >>>> sdb:~ # qstat | grep linpyl | grep batch | wc >>>> 945 5670 82215 >>>> >>>> sdb:~ # qmgr -c "print queue batch" >>>> # >>>> # Create queues and set their attributes. >>>> # >>>> # >>>> # Create and define queue batch >>>> # >>>> create queue batch >>>> set queue batch queue_type = Execution >>>> set queue batch max_user_queuable = 500 >>>> set queue batch resources_min.mppwidth = 1 >>>> set queue batch resources_default.mppwidth = 24 >>>> set queue batch resources_default.walltime = 00:10:00 >>>> set queue batch acl_group_enable = False >>>> set queue batch resources_available.nodes = 726 >>>> set queue batch enabled = True >>>> set queue batch started = True >>>> >>>> How would it be possible for a user to have 945 jobs in the queue when the limit should be 500? >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>>> >>>> >>>> -- >>>> David Beer | Software Engineer >>>> Adaptive Computing >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >>> >>> >>> -- >>> David Beer | Software Engineer >>> Adaptive Computing >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 203 bytes Desc: Message signed with OpenPGP using GPGMail Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/466c7202/attachment-0001.bin From dbeer at adaptivecomputing.com Thu Feb 9 14:59:58 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Thu, 9 Feb 2012 14:59:58 -0700 Subject: [torqueusers] Torque not honoring max_user_queuable In-Reply-To: <88388003-8D26-4027-931B-19777974EFCC@mcs.anl.gov> References: <22D6587F-A6E3-4E8F-B277-7273849BC581@mcs.anl.gov> <93149F70-EFCC-44F6-9CE3-022C475BEE70@mcs.anl.gov> <88388003-8D26-4027-931B-19777974EFCC@mcs.anl.gov> Message-ID: Ti, Sorry, I sort of got pulled off onto other things and haven't looked into this further. There shouldn't be further configuration that you need. I will see if I can reproduce this. David On Thu, Feb 9, 2012 at 2:41 PM, Ti Leggett wrote: > Is there more configuration I need to do to make this effective with a > routing queue? > > On Feb 4, 2012, at 11:44 AM, Ti Leggett wrote: > > > All jobs go through a routing queue. > > > > On Feb 3, 2012, at 11:44 AM, David Beer wrote: > > > >> I'm also curious - is this done through a routing queue or routing > queues? Is it class remapping in Moab? It looks like it isn't qsub -q > > >> > >> David > >> > >> On Fri, Feb 3, 2012 at 10:15 AM, Ti Leggett > wrote: > >> submit_args = -A CI-MCB000083 -l walltime=48:00:00, > >> mppwidth=48 /lustre/beagle/linpyl/project.qsub > >> > >> On Feb 3, 2012, at 11:03 AM, David Beer wrote: > >> > >>> If you qstat -f a few of the jobs you can see the submit arguments. At > higher log levels the entire job submission is there, but I don't known if > your log levels would be that high. > >>> > >>> David > >>> > >>> On Fri, Feb 3, 2012 at 9:21 AM, Ti Leggett > wrote: > >>> I'm assuming using qsub, but it's other users doing this so I'm not > 100% sure. Is there a way to find out from logs or other tools? > >>> > >>> On Feb 3, 2012, at 10:06 AM, David Beer wrote: > >>> > >>>> Ti, > >>>> > >>>> How are you submitting the jobs? I assume this is TORQUE 2.5.9? > >>>> > >>>> David > >>>> > >>>> On Fri, Feb 3, 2012 at 8:27 AM, Ti Leggett > wrote: > >>>> We've set queue limits that don't seem to be honored: > >>>> > >>>> sdb:~ # qstat | grep linpyl | grep batch | wc > >>>> 945 5670 82215 > >>>> > >>>> sdb:~ # qmgr -c "print queue batch" > >>>> # > >>>> # Create queues and set their attributes. > >>>> # > >>>> # > >>>> # Create and define queue batch > >>>> # > >>>> create queue batch > >>>> set queue batch queue_type = Execution > >>>> set queue batch max_user_queuable = 500 > >>>> set queue batch resources_min.mppwidth = 1 > >>>> set queue batch resources_default.mppwidth = 24 > >>>> set queue batch resources_default.walltime = 00:10:00 > >>>> set queue batch acl_group_enable = False > >>>> set queue batch resources_available.nodes = 726 > >>>> set queue batch enabled = True > >>>> set queue batch started = True > >>>> > >>>> How would it be possible for a user to have 945 jobs in the queue > when the limit should be 500? > >>>> _______________________________________________ > >>>> torqueusers mailing list > >>>> torqueusers at supercluster.org > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> David Beer | Software Engineer > >>>> Adaptive Computing > >>>> > >>>> _______________________________________________ > >>>> torqueusers mailing list > >>>> torqueusers at supercluster.org > >>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>> > >>> > >>> > >>> > >>> -- > >>> David Beer | Software Engineer > >>> Adaptive Computing > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> > >> > >> > >> -- > >> David Beer | Software Engineer > >> Adaptive Computing > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/4a0f13c9/attachment.html From knielson at adaptivecomputing.com Thu Feb 9 15:39:52 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 09 Feb 2012 15:39:52 -0700 (MST) Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: Message-ID: ----- Original Message ----- > From: "Rainer M Krug" > To: torqueusers at supercluster.org > Sent: Thursday, February 9, 2012 2:16:07 AM > Subject: [torqueusers] Specifying nodes which can be used in array job > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi > > assuming I have cluster of 10 nodes (node01, ... node10), of which I > am not the administrator. > > Some nodes are setup slightly different, so that a certain job only > runs on nodes node01 to node05. > > So I would like to submit an array job and specify "only use the > node01, node02, node03, node04 or node05 to run the each individual > job". > > How can I do that? I know that I can use -l to specify resource > requirements, but if I specify nodes=..., *each* job will allocate > *all* nodes for the job, which is not what I want - each individual > job should run on one of the nodes. > > so: > > qsub the_script.sub -t 1-10 > > and how do I specify the nodes? > > Thanks, > > Rainer Rainer, Are there feature (properties) in the nodes files of those hosts which would allow you to specify a feature on the qsub line? Ken From Gareth.Williams at csiro.au Thu Feb 9 16:04:45 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 10 Feb 2012 10:04:45 +1100 Subject: [torqueusers] submitting a job (interactively) issue In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E08@exvic-mbx04.nexus.csiro.au> > From: Hakeem Almabrazi [mailto:halmabrazi at idtdna.com] > Sent: Friday, 10 February 2012 8:33 AM > To: Torque Users Mailing List > Subject: [torqueusers] submitting a job (interactively) issue > Hi, > I have tried to submit a job using the option -I and I got the message > Qsub: waiting for job # to start > Qsub: job # ready > And that is it. > If I qstat I got a message saying the job # is still "R" running . > It looks like I have lack of understanding on how to use this option but here is my submit job request: > >qsub -l nodes=1 -N jobName -I -v "some parameters" shellScript > If I run the above request without ?the -I option, it runs fine without any issue. > Someone might ask the question, why I am running it "interactively"? > Well, I want to force the program which issued the request to wait for the result and do something with it after that. > Thank you for your help. Hi Hakeem, For an interactive job you have two options: 1) don't supply a script - then you should get an interactive shell session on a compute node when the job starts like: > qsub -I qsub: waiting for job 421117.server_host to start qsub: job 421117.server_host ready Begin PBS Prologue Fri Feb 10 10:00:31 EST 2012 1328828431 Job ID: 421117.server_host Username: wil240 Group: asc Name: STDIN Resources: neednodes=1,nodes=1,vmem=500mb,walltime=00:10:00 Queue: normal Nodes: n001 First Node: n001 Fri Feb 10 10:00:32 EST 2012 Directory: /home/asc/wil240 Fri Feb 10 10:00:32 EST 2012 wil240 at n001:~> uname -a Linux n001 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux wil240 at n001:~> logout qsub: job 421117.server_host completed 2) use the -x option and supply a single command (not a script) - like: wil240 at burnet-login:~> qsub -Ix 'uname -a' qsub: waiting for job 421120.burnet-srv.idpx.hpsc.csiro.au to start qsub: job 421120.burnet-srv.idpx.hpsc.csiro.au ready Begin PBS Prologue Fri Feb 10 10:02:39 EST 2012 1328828559 Job ID: 421120.server_host Username: wil240 Group: asc Name: none Resources: neednodes=1,nodes=1,vmem=500mb,walltime=00:10:00 Queue: normal Nodes: n001 First Node: n001 Fri Feb 10 10:02:39 EST 2012 Linux n001 2.6.32.12-0.7-default #1 SMP 2010-05-20 11:14:20 +0200 x86_64 x86_64 x86_64 GNU/Linux qsub: job 421120.server_host completed I think you want the second option. Gareth ps. It would be better to send plain text email to a mailing list (not html). ? From jjc at iastate.edu Thu Feb 9 16:20:44 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 9 Feb 2012 23:20:44 +0000 Subject: [torqueusers] problem with jobs sharing cores In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210196746B@ITSDAG1D.its.iastate.edu> Mike, We had this issue with OpenMPI and the mca parameter mpi_paffinity_alone setting mpi_paffinity_alone gives somewhat better performance than not setting it due to better cache hits when there is only one job running on a node. However, this places the N mpi processes on cores 0 to N-1 so for 3 four process MPI programs running on a 12 core node, you would have 3 processes each running on cores 0 through 3. Doing what you are doing, launching 3 jobs using 4 processes each with openmpi and having mpi_paffinity_alone set on (perhaps by default) would cause exactly the behavior you are seeing, you would have 3 mpi processes rank 0 running on core 0, 3 rank 1 processes running on core 1, etc., and no MPI processes running on cores 4-11. Perhaps mvapich has a similar mechanism to mpi_paffinity_alone that you are encountering. man mpirun should help you figure this out, or you could ask the cluster admin, or whoever is an expert in using mvapich in your environment. Below, I have included part of the General run-time tuning portion of the FAQ for OpenMPI from http://www.open-mpi.org/faq/ I hope this helps - Jim James Coyle, PhD High Performance Computing Group Iowa State Univ. web: http://jjc.public.iastate.edu/ Open MPI 1.2 offers only crude control, with the MCA parameter "mpi_paffinity_alone". For example: $ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out (Just like any other MCA parameter, mpi_paffinity_alone can be set via any of the normal MCA parameter mechanisms.) On each node where your job is running, your job's MPI processes will be bound, one-to-one, in the order of their global MPI ranks, to the lowest-numbered processing units (for example, cores or hardware threads) on the node as identified by the OS. Further, memory affinity will also be enabled if it is supported on the node,as described in a different FAQ entry. If multiple jobs are launched on the same node in this manner, they will compete for the same processing units and severe performance degradation will likely result. Therefore, this MCA parameter is best used when you know your job will be "alone" on the nodes where it will run. Since each process is bound to a single processing unit, performance will likely suffer catastrophically if processes are multi-threaded. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zulauf, Michael Sent: Thursday, February 09, 2012 12:30 PM To: torqueusers at supercluster.org Subject: [torqueusers] problem with jobs sharing cores Hi all. . . I apologize if this message appears more than once - there was an issue with my email address and list registration (which I hope is now fixed), and so I'm having to resend this. . . Anyway, where I work, we've had a problem for a while that we haven't been able to resolve. I'm not certain of the cause - if it's related to Torque, or Maui, or something else. But here goes. . . We've got a small cluster of 16 nodes, each with dual hex-core processors. 12 cores per node, 192 cores total. The problem is that if I launch small jobs, where multiple jobs should be able to share a node without sharing cores, I instead get cores that are running more than one process, while other cores are idle. The primary executable is WRF (weather prediction model), but the problem occurs for other parallel codes. The codes have been built to utilize MPI (not OpenMP, or MPI/OpenMP). As an example, if I launch a series of jobs which request 4 cores each, I get 3 jobs assigned to each node. That should be fine, as each node has 12 cores, and there should be no need to share cores. Instead, I get 4 "overloaded" cores (each running 3 processes) and 8 idle cores. Obviously not an ideal situation. If I submit only a single small job, in which case it's alone on a node, then it runs great. Similarly, if I launch a large job which spans more than one node, it also works well - as long as it's not sharing nodes with other jobs. The problem only occurs (and always occurs) when parallel jobs share a node. BTW, the qsub command does not explicitly request specific cores, or anything like that. I'm not the administrator - just the primary user. The administrator (who was not previously familiar with Torque/Maui) has been struggling with this for a bit, and is rather busy with other duties, so I thought I'd check in here to see if anybody had suggestions I could pass along. Here are some specifics, as far as I know them: HP blade hardware dual Intel Xeon X5670 processors Infiniband interconnect (not an issue in this case?) the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly) Torque 3.0.2 mvapich2-1.7rc1 PGI7.2-5 compilers WRF 3.3.1 Any thoughts? I've probably left out relevant information. If so, please ask for clarification. Thanks, Mike -- Mike Zulauf Meteorologist, Lead Senior Asset Optimization Iberdrola Renewables 1125 NW Couch, Suite 700 Portland, OR 97209 Office: 503-478-6304 Cell: 503-913-0403 This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/e99e2188/attachment-0001.html From r.m.krug at gmail.com Fri Feb 10 00:49:45 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Fri, 10 Feb 2012 08:49:45 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: References: Message-ID: <4F34CC19.6070805@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/02/12 23:39, Ken Nielson wrote: > > > ----- Original Message ----- >> From: "Rainer M Krug" To: >> torqueusers at supercluster.org Sent: Thursday, February 9, 2012 >> 2:16:07 AM Subject: [torqueusers] Specifying nodes which can be >> used in array job >> >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> >> Hi >> >> assuming I have cluster of 10 nodes (node01, ... node10), of >> which I am not the administrator. >> >> Some nodes are setup slightly different, so that a certain job >> only runs on nodes node01 to node05. >> >> So I would like to submit an array job and specify "only use the >> node01, node02, node03, node04 or node05 to run the each >> individual job". >> >> How can I do that? I know that I can use -l to specify resource >> requirements, but if I specify nodes=..., *each* job will >> allocate *all* nodes for the job, which is not what I want - each >> individual job should run on one of the nodes. >> >> so: >> >> qsub the_script.sub -t 1-10 >> >> and how do I specify the nodes? >> >> Thanks, >> >> Rainer > > Rainer, > > Are there feature (properties) in the nodes files of those hosts > which would allow you to specify a feature on the qsub line? No - unfortunately not. > > Ken -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk80zBgACgkQoYgNqgF2egri1wCfUUqDmOigKB8hCyCvt30pu5jZ kewAnjfVc6o7rIjFua0ukEBhkaNe5McS =nBnt -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Fri Feb 10 00:49:45 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Fri, 10 Feb 2012 08:49:45 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: References: Message-ID: <4F34CC19.6070805@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/02/12 23:39, Ken Nielson wrote: > > > ----- Original Message ----- >> From: "Rainer M Krug" To: >> torqueusers at supercluster.org Sent: Thursday, February 9, 2012 >> 2:16:07 AM Subject: [torqueusers] Specifying nodes which can be >> used in array job >> >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> >> Hi >> >> assuming I have cluster of 10 nodes (node01, ... node10), of >> which I am not the administrator. >> >> Some nodes are setup slightly different, so that a certain job >> only runs on nodes node01 to node05. >> >> So I would like to submit an array job and specify "only use the >> node01, node02, node03, node04 or node05 to run the each >> individual job". >> >> How can I do that? I know that I can use -l to specify resource >> requirements, but if I specify nodes=..., *each* job will >> allocate *all* nodes for the job, which is not what I want - each >> individual job should run on one of the nodes. >> >> so: >> >> qsub the_script.sub -t 1-10 >> >> and how do I specify the nodes? >> >> Thanks, >> >> Rainer > > Rainer, > > Are there feature (properties) in the nodes files of those hosts > which would allow you to specify a feature on the qsub line? No - unfortunately not. > > Ken -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk80zBgACgkQoYgNqgF2egri1wCfUUqDmOigKB8hCyCvt30pu5jZ kewAnjfVc6o7rIjFua0ukEBhkaNe5McS =nBnt -----END PGP SIGNATURE----- From christina.salls at noaa.gov Fri Feb 10 06:57:02 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Fri, 10 Feb 2012 08:57:02 -0500 Subject: [torqueusers] cluster network configuration Message-ID: Hi all, I am experiencing a problem with my torque server and client connection. My server has an ethernet interface on the public network that is the named server in my torque config. There is a second network interface that is a private network to the cluster on a 10.0.10 network, with a hostname of admin. The compute nodes are only connected to the private network, however the named server in both the head node and the compute nodes is the public interface hostname. A pbsnodes command shows all nodes as down. The log file on the server shows this error: 02/09/2012 18:10:56;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect from 192.94.173.9:1022 (address not trusted - check entry in server_priv/nodes) That address is the public address of the server. I am wondering if I should name the server admin? The only compute nodes that the server will need to access are on the private network. What is the standard way of setting up a single cluster environment? I am able to ping the compute nodes from the head node and ssh with no authentication. From the compute nodes I am able to ping the head node as well, using the public or private network hostname, and I am able to ssh either to the "wings" interface or the "admin" interface without authentication. It seems like the communication lines are open. Any suggestions would be welcome. Thanks in advance, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120210/56eeeeb6/attachment.html From fotis at cern.ch Sat Feb 11 12:53:24 2012 From: fotis at cern.ch (Fotis Georgatos) Date: Sat, 11 Feb 2012 21:53:24 +0200 Subject: [torqueusers] problem with jobs sharing cores In-Reply-To: References: Message-ID: <4F36C734.9060602@cern.ch> Hi Mike, I had to debug a problem during last week which appears somewhat related; in short, the mpi stack (openmpi) was intervening in cpu affinity. I was able to solve it in my case with the following line: "mpiexec --report-bindings --cpus-per-rank 4 -np ..." In your case I recommend a check on the equivalent FAQ of your mpi stack like: http://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4 From time to time you would like to check that your scheduler is actually placing jobs on nodes as you imagine it would; this tool would help in this: http://fotis.web.cern.ch/fotis/QTOP/ (tarball works fine in userspace, rpm & repo are available for sysadmins). enjoy, Fotis On 10/02/2012 01:20, torqueusers-request at supercluster.org wrote: > From:torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zulauf, Michael > Sent: Thursday, February 09, 2012 12:30 PM > To:torqueusers at supercluster.org > Subject: [torqueusers] problem with jobs sharing cores > > Hi all. . . > > I apologize if this message appears more than once - there was an issue with my email address and list registration (which I hope is now fixed), and so I'm having to resend this. . . > > Anyway, where I work, we've had a problem for a while that we haven't been able to resolve. I'm not certain of the cause - if it's related to Torque, or Maui, or something else. But here goes. . . > > We've got a small cluster of 16 nodes, each with dual hex-core processors. 12 cores per node, 192 cores total. The problem is that if I launch small jobs, where multiple jobs should be able to share a node without sharing cores, I instead get cores that are running more than one process, while other cores are idle. The primary executable is WRF (weather prediction model), but the problem occurs for other parallel codes. The codes have been built to utilize MPI (not OpenMP, or MPI/OpenMP). > > As an example, if I launch a series of jobs which request 4 cores each, I get 3 jobs assigned to each node. That should be fine, as each node has 12 cores, and there should be no need to share cores. Instead, I get 4 "overloaded" cores (each running 3 processes) and 8 idle cores. Obviously not an ideal situation. If I submit only a single small job, in which case it's alone on a node, then it runs great. Similarly, if I launch a large job which spans more than one node, it also works well - as long as it's not sharing nodes with other jobs. The problem only occurs (and always occurs) when parallel jobs share a node. BTW, the qsub command does not explicitly request specific cores, or anything like that. > > I'm not the administrator - just the primary user. The administrator (who was not previously familiar with Torque/Maui) has been struggling with this for a bit, and is rather busy with other duties, so I thought I'd check in here to see if anybody had suggestions I could pass along. > > Here are some specifics, as far as I know them: > HP blade hardware > dual Intel Xeon X5670 processors > Infiniband interconnect (not an issue in this case?) > the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly) > Torque 3.0.2 > mvapich2-1.7rc1 > PGI7.2-5 compilers > WRF 3.3.1 > > Any thoughts? I've probably left out relevant information. If so, please ask for clarification. > > Thanks, > Mike > > -- > Mike Zulauf > Meteorologist, Lead Senior > Asset Optimization > Iberdrola Renewables > 1125 NW Couch, Suite 700 > Portland, OR 97209 > Office: 503-478-6304 Cell: 503-913-0403 -- echo "sysadmin know better bash than english" | sed s/min/mins/ \ | sed 's/better bash/bash better/' # Yelling in a CERN forum From jwbacon at tds.net Sat Feb 11 14:32:21 2012 From: jwbacon at tds.net (Jason bacon) Date: Sat, 11 Feb 2012 15:32:21 -0600 Subject: [torqueusers] submitting a job (interactively) issue In-Reply-To: References: Message-ID: <4F36DE65.4010304@tds.net> I've (apparently) had the same issue, but have not found a solution yet. Can you provide the following: 1. Operating system and version 2. Torque version 3. Relevant entries from server_logs on the submit node and mom_logs on the allocated compute node 4. Anything unusual in the system log When I ran into this issue, I found some errors in the logs regarding failed socket connections. I think it might be a permissions issue, but have not had time to investigate yet. -J On 2/9/12 3:33 PM, Hakeem Almabrazi wrote: > > Hi, > > I have tried to submit a job using the option --I and I got the message > > Qsub: waiting for job # to start > > Qsub: job # ready > > And that is it. > > If I qstat I got a message saying the job # is still "R" running ... > > It looks like I have lack of understanding on how to use this option > but here is my submit job request: > > >qsub --l nodes=1 --N jobName --I --v "some parameters" shellScript > > If I run the above request without the --I option, it runs fine > without any issue. > > Someone might ask the question, why I am running it "interactively"? > > Well, I want to force the program which issued the request to wait for > the result and do something with it after that. > > Thank you for your help. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120211/8b9b4ba9/attachment-0001.html From gus at ldeo.columbia.edu Sat Feb 11 16:01:25 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sat, 11 Feb 2012 18:01:25 -0500 Subject: [torqueusers] submitting a job (interactively) issue In-Reply-To: <4F36DE65.4010304@tds.net> References: <4F36DE65.4010304@tds.net> Message-ID: <3237A3BF-58F8-4ECA-927A-A0A291B28045@ldeo.columbia.edu> Hi Hakeem Not sure if I understood right the issue. Anyway, did the job return a shell prompt to you? That is what it is expected to do. From 'man qsub': " If the -I option is specified on the command line or in a script directive, or if the "interactive" job attribute declared true via the -W option, -W interactive=true, either on the command line or in a script directive, the job is an interactive job. The script will be processed for directives, but will not be included with the job. When the job begins execution, all input to the job is from the terminal session in which qsub is run- ning." Only the #PBS directives in your shell script [if any] would be processed, not the commands. Have you tried to submit the job without the shell script? >> qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters? This should give you a shell prompt in one of the nodes. From there you can 'cd' to your work directory [cd $PBS_O_WORKDIR], and run the shell script [./shellscript]. All I/O is on the terminal, no stderr/stdout files are generated. At the end just do CTRL-D at the shell prompt to end the job. Check the details in 'man qsub'. I hope this helps, Gus Correa On Feb 11, 2012, at 4:32 PM, Jason bacon wrote: > > I've (apparently) had the same issue, but have not found a solution yet. > > Can you provide the following: > > 1. Operating system and version > 2. Torque version > 3. Relevant entries from server_logs on the submit node and mom_logs on the allocated compute node > 4. Anything unusual in the system log > > When I ran into this issue, I found some errors in the logs regarding failed socket connections. I think it might be a permissions issue, but have not had time to investigate yet. > > -J > > On 2/9/12 3:33 PM, Hakeem Almabrazi wrote: >> Hi, >> >> I have tried to submit a job using the option ?I and I got the message >> >> Qsub: waiting for job # to start >> Qsub: job # ready >> >> And that is it. >> >> If I qstat I got a message saying the job # is still ?R? running ? >> >> It looks like I have lack of understanding on how to use this option but here is my submit job request: >> >> >qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters? shellScript >> >> >> If I run the above request without the ?I option, it runs fine without any issue. >> >> Someone might ask the question, why I am running it ?interactively?? >> >> Well, I want to force the program which issued the request to wait for the result and do something with it after that. >> >> Thank you for your help. >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> torqueusers mailing list >> >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From jwbacon at tds.net Sat Feb 11 16:31:10 2012 From: jwbacon at tds.net (Jason bacon) Date: Sat, 11 Feb 2012 17:31:10 -0600 Subject: [torqueusers] submitting a job (interactively) issue In-Reply-To: <3237A3BF-58F8-4ECA-927A-A0A291B28045@ldeo.columbia.edu> References: <4F36DE65.4010304@tds.net> <3237A3BF-58F8-4ECA-927A-A0A291B28045@ldeo.columbia.edu> Message-ID: <4F36FA3E.6000701@tds.net> If I use -I on the command line, my system drops me into a shell on a scheduled node, which seems to be what the man page is describing. However, if I use #PBS -I in a script, the job fails, which seems to contradict the man page. Example: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #!/bin/sh #PBS -I ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this case, I get the following email: From adm at peregrine.hpc.uwm.edu Sat Feb 11 17:18:09 2012 Date: Sat, 11 Feb 2012 17:18:09 -0600 (CST) From: adm at peregrine.hpc.uwm.edu To: bacon at peregrine.hpc.uwm.edu Subject: PBS JOB 742.peregrine.hpc.uwm.edu PBS Job Id: 742.peregrine.hpc.uwm.edu Job Name: hostname Exec host: compute-02/0 Aborted by PBS Server Job cannot be executed See Administrator for help and error messages in the logs. I haven't considered this a major issue. Instead, I've worked around the issue using the following script, which waits until the job ends and then shows the output. It doesn't allow interaction, but if all you need is to see the output when the job is done, it works fine. #!/bin/sh # Script: qsubw # Submit a job, wait for it to complete, and show the output if [ $# -ne 1 ]; then printf "Usage: $0 script\n" exit 1 fi script=$1 # Strip out job name and output file options # FIXME: This will strip out other options on the same line egrep -v '#PBS -[Noe]' $script > $script.tmp job_id=`qsub $script.tmp | cut -d '.' -f 1` printf ' Job ID = %s\n' $job_id qstat -f $job_id | fgrep exec_host while [ `qstat | grep "^$job_id" | awk ' { print $5 }'` != 'C' ]; do sleep 1 done stem=${script%.*} for file in $stem.pbs.tmp.o$job_id $stem.pbs.tmp.e$job_id; do if [ -s $file ]; then printf "\n%s:\n" $file cat $file fi rm $file done -J On 2/11/12 5:01 PM, Gustavo Correa wrote: > Hi Hakeem > > Not sure if I understood right the issue. > Anyway, did the job return a shell prompt to you? > That is what it is expected to do. > > > From 'man qsub': > > " If the -I option is specified on the command line or in a script > directive, or if the "interactive" job attribute declared true > via the -W option, -W interactive=true, either on the command > line or in a script directive, the job is an interactive job. > The script will be processed for directives, but will not be > included with the job. When the job begins execution, all input > to the job is from the terminal session in which qsub is run- > ning." > > Only the #PBS directives in your shell script [if any] would be processed, not the commands. > > Have you tried to submit the job without the shell script? > >>> qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters? > > This should give you a shell prompt in one of the nodes. > > From there you can 'cd' to your work directory [cd $PBS_O_WORKDIR], > and run the shell script [./shellscript]. > All I/O is on the terminal, no stderr/stdout files are generated. > At the end just do CTRL-D at the shell prompt to end the job. > > Check the details in 'man qsub'. > > I hope this helps, > Gus Correa > > On Feb 11, 2012, at 4:32 PM, Jason bacon wrote: > >> I've (apparently) had the same issue, but have not found a solution yet. >> >> Can you provide the following: >> >> 1. Operating system and version >> 2. Torque version >> 3. Relevant entries from server_logs on the submit node and mom_logs on the allocated compute node >> 4. Anything unusual in the system log >> >> When I ran into this issue, I found some errors in the logs regarding failed socket connections. I think it might be a permissions issue, but have not had time to investigate yet. >> >> -J >> >> On 2/9/12 3:33 PM, Hakeem Almabrazi wrote: >>> Hi, >>> >>> I have tried to submit a job using the option ?I and I got the message >>> >>> Qsub: waiting for job # to start >>> Qsub: job # ready >>> >>> And that is it. >>> >>> If I qstat I got a message saying the job # is still ?R? running ? >>> >>> It looks like I have lack of understanding on how to use this option but here is my submit job request: >>> >>>> qsub ?l nodes=1 ?N jobName ?I ?v ?some parameters? shellScript >>> >>> >>> If I run the above request without the ?I option, it runs fine without any issue. >>> >>> Someone might ask the question, why I am running it ?interactively?? >>> >>> Well, I want to force the program which issued the request to wait for the result and do something with it after that. >>> >>> Thank you for your help. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Sun Feb 12 08:53:14 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Sun, 12 Feb 2012 10:53:14 -0500 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <4F34CC19.6070805@gmail.com> References: <4F34CC19.6070805@gmail.com> Message-ID: <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> Hi Rainer, Like Ken wrote it is possible with feature property. I use this feature heavily to place jobs on specific nodes. To add feature to nodes for i in {0..5}; do qmgr -c "set node node0$i properties += arrays"; done Here feature is arrays. You can replace that with whatever you like. Once you've done this you can get array jobs placed on these nodes by requesting this feature in qsub such as >>> qsub the_script.sub -t 1-10 -l feature='arrays' This would put your jobs on the nodes that have property arrays. In this case the nodes are 0 to 5. In my case I wrote a qsub wrapper which goes through the pbs scripts and command line and adds this feature line such as #PBS -l feature= to the script so that they are placed on right nodes. This comes very handy especially when you have nodes with diiferent amounts of memory under the same queue. If your scheduler is moab you can do really cool stuff using this feature property. Hope this helps. Sreedhar. On 10-Feb-2012, at 2:49 AM, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 09/02/12 23:39, Ken Nielson wrote: >> >> >> ----- Original Message ----- >>> From: "Rainer M Krug" To: >>> torqueusers at supercluster.org Sent: Thursday, February 9, 2012 >>> 2:16:07 AM Subject: [torqueusers] Specifying nodes which can be >>> used in array job >>> >>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>> >>> Hi >>> >>> assuming I have cluster of 10 nodes (node01, ... node10), of >>> which I am not the administrator. >>> >>> Some nodes are setup slightly different, so that a certain job >>> only runs on nodes node01 to node05. >>> >>> So I would like to submit an array job and specify "only use the >>> node01, node02, node03, node04 or node05 to run the each >>> individual job". >>> >>> How can I do that? I know that I can use -l to specify resource >>> requirements, but if I specify nodes=..., *each* job will >>> allocate *all* nodes for the job, which is not what I want - each >>> individual job should run on one of the nodes. >>> >>> so: >>> >>> qsub the_script.sub -t 1-10 >>> >>> and how do I specify the nodes? >>> >>> Thanks, >>> >>> Rainer >> >> Rainer, >> >> Are there feature (properties) in the nodes files of those hosts >> which would allow you to specify a feature on the qsub line? > > No - unfortunately not. > >> >> Ken > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk80zBgACgkQoYgNqgF2egri1wCfUUqDmOigKB8hCyCvt30pu5jZ > kewAnjfVc6o7rIjFua0ukEBhkaNe5McS > =nBnt > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120212/d6dc9fed/attachment.html From r.m.krug at gmail.com Mon Feb 13 01:37:11 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Mon, 13 Feb 2012 09:37:11 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> Message-ID: <4F38CBB7.8000107@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks a lot - this definitely helps. I will get in contact with our admin to add the features to the nodes. Cheers, Rainer On 12/02/12 16:53, Sreedhar Manchu wrote: > Hi Rainer, > > Like Ken wrote it is possible with feature property. I use this > feature heavily to place jobs on specific nodes. > > To add feature to nodes > > for i in {0..5}; do qmgr -c "set node node0$i properties += > arrays"; done > > Here feature is arrays. You can replace that with whatever you > like. > > Once you've done this you can get array jobs placed on these nodes > by requesting this feature in qsub such as > >>>> qsub the_script.sub -t 1-10 -l feature='arrays' > > This would put your jobs on the nodes that have property arrays. In > this case the nodes are 0 to 5. > > In my case I wrote a qsub wrapper which goes through the pbs > scripts and command line and adds this feature line such as #PBS -l > feature= to the script so that they are placed on > right nodes. This comes very handy especially when you have nodes > with diiferent amounts of memory under the same queue. > > If your scheduler is moab you can do really cool stuff using this > feature property. > > Hope this helps. > > Sreedhar. > > > > On 10-Feb-2012, at 2:49 AM, Rainer M Krug > wrote: > > On 09/02/12 23:39, Ken Nielson wrote: >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Rainer M Krug" >>>> > To: >>>>> torqueusers at supercluster.org >>>>> Sent: Thursday, >>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers] >>>>> Specifying nodes which can be used in array job >>>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>> >>>>> Hi >>>>> >>>>> assuming I have cluster of 10 nodes (node01, ... node10), >>>>> of which I am not the administrator. >>>>> >>>>> Some nodes are setup slightly different, so that a certain >>>>> job only runs on nodes node01 to node05. >>>>> >>>>> So I would like to submit an array job and specify "only >>>>> use the node01, node02, node03, node04 or node05 to run the >>>>> each individual job". >>>>> >>>>> How can I do that? I know that I can use -l to specify >>>>> resource requirements, but if I specify nodes=..., *each* >>>>> job will allocate *all* nodes for the job, which is not >>>>> what I want - each individual job should run on one of the >>>>> nodes. >>>>> >>>>> so: >>>>> >>>>> qsub the_script.sub -t 1-10 >>>>> >>>>> and how do I specify the nodes? >>>>> >>>>> Thanks, >>>>> >>>>> Rainer >>>> >>>> Rainer, >>>> >>>> Are there feature (properties) in the nodes files of those >>>> hosts which would allow you to specify a feature on the qsub >>>> line? > > No - unfortunately not. > >>>> >>>> Ken > >> >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk84y7cACgkQoYgNqgF2egodAgCfXBJiNsn+NtC8B3fO3R1fQTGd VG0AnjzI5iBr390vLggHRpm4EmRybxSC =x/dl -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Mon Feb 13 01:37:11 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Mon, 13 Feb 2012 09:37:11 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> Message-ID: <4F38CBB7.8000107@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks a lot - this definitely helps. I will get in contact with our admin to add the features to the nodes. Cheers, Rainer On 12/02/12 16:53, Sreedhar Manchu wrote: > Hi Rainer, > > Like Ken wrote it is possible with feature property. I use this > feature heavily to place jobs on specific nodes. > > To add feature to nodes > > for i in {0..5}; do qmgr -c "set node node0$i properties += > arrays"; done > > Here feature is arrays. You can replace that with whatever you > like. > > Once you've done this you can get array jobs placed on these nodes > by requesting this feature in qsub such as > >>>> qsub the_script.sub -t 1-10 -l feature='arrays' > > This would put your jobs on the nodes that have property arrays. In > this case the nodes are 0 to 5. > > In my case I wrote a qsub wrapper which goes through the pbs > scripts and command line and adds this feature line such as #PBS -l > feature= to the script so that they are placed on > right nodes. This comes very handy especially when you have nodes > with diiferent amounts of memory under the same queue. > > If your scheduler is moab you can do really cool stuff using this > feature property. > > Hope this helps. > > Sreedhar. > > > > On 10-Feb-2012, at 2:49 AM, Rainer M Krug > wrote: > > On 09/02/12 23:39, Ken Nielson wrote: >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Rainer M Krug" >>>> > To: >>>>> torqueusers at supercluster.org >>>>> Sent: Thursday, >>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers] >>>>> Specifying nodes which can be used in array job >>>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>> >>>>> Hi >>>>> >>>>> assuming I have cluster of 10 nodes (node01, ... node10), >>>>> of which I am not the administrator. >>>>> >>>>> Some nodes are setup slightly different, so that a certain >>>>> job only runs on nodes node01 to node05. >>>>> >>>>> So I would like to submit an array job and specify "only >>>>> use the node01, node02, node03, node04 or node05 to run the >>>>> each individual job". >>>>> >>>>> How can I do that? I know that I can use -l to specify >>>>> resource requirements, but if I specify nodes=..., *each* >>>>> job will allocate *all* nodes for the job, which is not >>>>> what I want - each individual job should run on one of the >>>>> nodes. >>>>> >>>>> so: >>>>> >>>>> qsub the_script.sub -t 1-10 >>>>> >>>>> and how do I specify the nodes? >>>>> >>>>> Thanks, >>>>> >>>>> Rainer >>>> >>>> Rainer, >>>> >>>> Are there feature (properties) in the nodes files of those >>>> hosts which would allow you to specify a feature on the qsub >>>> line? > > No - unfortunately not. > >>>> >>>> Ken > >> >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk84y7cACgkQoYgNqgF2egodAgCfXBJiNsn+NtC8B3fO3R1fQTGd VG0AnjzI5iBr390vLggHRpm4EmRybxSC =x/dl -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Mon Feb 13 04:42:57 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Mon, 13 Feb 2012 12:42:57 +0100 Subject: [torqueusers] Setting up torque on PC to test scripts? Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi I am thinking of setting up a dummy cluster on my local PC to test my submit scripts. Is this easily possible? Or is there even a virtual machine to download which has torque installed? That would be the easiest. Thanks, Rainer -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8490EACgkQoYgNqgF2egqJDgCfXYX1gv1i2B/ABAaZjoEtg44C awMAniAtzHKb8znKxGmbHJUDOcFs5ohT =RUyd -----END PGP SIGNATURE----- From nt_mahmood at yahoo.com Mon Feb 13 06:15:24 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 13 Feb 2012 05:15:24 -0800 (PST) Subject: [torqueusers] submitting a job, then modify script and resubmit Message-ID: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> Dear all, 1- Assume I have a script (named scr.sh) like this: #PBS -N run1 #PBS -V #PBS -l nodes=1 #PBS -q long #PBS -o /home/mahmood/run1.out #PBS -j oe cd $PBS_O_WORKDIR ./run1 config1 2- Then I run qsub scr.sh 3- the jobs state is 'Q' since all cores are busy. That is fine... 4- while run1 is in 'Q', I reopen scr.sh and change it to #PBS -N run2 #PBS -V #PBS -l nodes=1 #PBS -q long #PBS -o /home/mahmood/run2.out #PBS -j oe cd $PBS_O_WORKDIR ./run1 config2 5- Then I run? qsub scr.sh 6- this job also is in 'Q' until some cores become free. 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. Hope that I state the problem correctly. // Naderan *Mahmood; From akohlmey at cmm.chem.upenn.edu Mon Feb 13 07:03:20 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 13 Feb 2012 09:03:20 -0500 Subject: [torqueusers] submitting a job, then modify script and resubmit In-Reply-To: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote: > Dear all, > 1- Assume I have a script (named scr.sh) like this: > > #PBS -N run1 > #PBS -V > #PBS -l nodes=1 > #PBS -q long > #PBS -o /home/mahmood/run1.out > #PBS -j oe > cd $PBS_O_WORKDIR > ./run1 config1 > > 2- Then I run > qsub scr.sh > > 3- the jobs state is 'Q' since all cores are busy. That is fine... > 4- while run1 is in 'Q', I reopen scr.sh and change it to > > #PBS -N run2 > #PBS -V > #PBS -l nodes=1 > #PBS -q long > #PBS -o /home/mahmood/run2.out > #PBS -j oe > cd $PBS_O_WORKDIR > ./run1 config2 > > > 5- Then I run > qsub scr.sh > > 6- this job also is in 'Q' until some cores become free. > 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. > > The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? > Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? qsub makes a copy of the submit script and the batch system will execute that. that is the only way how it can consistently feeding submission from standard input. axel. > You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. > Hope that I state the problem correctly. > > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From nt_mahmood at yahoo.com Mon Feb 13 07:48:50 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 13 Feb 2012 06:48:50 -0800 (PST) Subject: [torqueusers] submitting a job, then modify script and resubmit In-Reply-To: References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com> Thanks for your answer. I got that. Another question is about modification of other config files. Does qsub copy all files or it does only copy the script? In this case, I have a scr.sh which is: #PBS ... ./run config.ini the config.ini file looks like: exe = bin1 args = 8 The first qsub is: qsub scr.sh After submitting that, I modify the config.ini to exe = bin1 args = 64 Then I resubmit scr.sh again qsub scr.sh Please note that the scr.sh doesn't change in this case. However the config.ini is modified. Does your answer apply to this case? Regards // Naderan *Mahmood; ________________________________ From: Axel Kohlmeyer To: Mahmood Naderan ; Torque Users Mailing List Sent: Monday, February 13, 2012 5:33 PM Subject: Re: [torqueusers] submitting a job, then modify script and resubmit On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote: > Dear all, > 1- Assume I have a script (named scr.sh) like this: > > #PBS -N run1 > #PBS -V > #PBS -l nodes=1 > #PBS -q long > #PBS -o /home/mahmood/run1.out > #PBS -j oe > cd $PBS_O_WORKDIR > ./run1 config1 > > 2- Then I run > qsub scr.sh > > 3- the jobs state is 'Q' since all cores are busy. That is fine... > 4- while run1 is in 'Q', I reopen scr.sh and change it to > > #PBS -N run2 > #PBS -V > #PBS -l nodes=1 > #PBS -q long > #PBS -o /home/mahmood/run2.out > #PBS -j oe > cd $PBS_O_WORKDIR > ./run1 config2 > > > 5- Then I run > qsub scr.sh > > 6- this job also is in 'Q' until some cores become free. > 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. > > The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? > Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? qsub makes a copy of the submit script and the batch system will execute that. that is the only way how it can consistently feeding submission from standard input. axel. > You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. > Hope that I state the problem correctly. > > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From akohlmey at cmm.chem.upenn.edu Mon Feb 13 07:54:23 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 13 Feb 2012 09:54:23 -0500 Subject: [torqueusers] submitting a job, then modify script and resubmit In-Reply-To: <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com> References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com> Message-ID: On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote: > Thanks for your answer. I got that. > > Another question is about modification of other config files. Does qsub copy all files or it does only copy the script? it only copies the script. that is all it knows about. axel. > In this case, I have a scr.sh which is: > #PBS ... > ./run config.ini > > > the config.ini file looks like: > exe = bin1 > args = 8 > > The first qsub is: > qsub scr.sh > > After submitting that, I modify the config.ini to > > exe = bin1 > args = 64 > > Then I resubmit scr.sh again > > qsub scr.sh > > Please note that the scr.sh doesn't change in this case. However the config.ini is modified. > > > Does your answer apply to this case? > > Regards > // Naderan *Mahmood; > > > ________________________________ > From: Axel Kohlmeyer > To: Mahmood Naderan ; Torque Users Mailing List > Sent: Monday, February 13, 2012 5:33 PM > Subject: Re: [torqueusers] submitting a job, then modify script and resubmit > > On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote: >> Dear all, >> 1- Assume I have a script (named scr.sh) like this: >> >> #PBS -N run1 >> #PBS -V >> #PBS -l nodes=1 >> #PBS -q long >> #PBS -o /home/mahmood/run1.out >> #PBS -j oe >> cd $PBS_O_WORKDIR >> ./run1 config1 >> >> 2- Then I run >> qsub scr.sh >> >> 3- the jobs state is 'Q' since all cores are busy. That is fine... >> 4- while run1 is in 'Q', I reopen scr.sh and change it to >> >> #PBS -N run2 >> #PBS -V >> #PBS -l nodes=1 >> #PBS -q long >> #PBS -o /home/mahmood/run2.out >> #PBS -j oe >> cd $PBS_O_WORKDIR >> ./run1 config2 >> >> >> 5- Then I run >> qsub scr.sh >> >> 6- this job also is in 'Q' until some cores become free. >> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. >> >> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? >> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? > > qsub makes a copy of the submit script and the batch system will > execute that. that is the only way how it can consistently feeding > submission from standard input. > > axel. > > >> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. >> Hope that I state the problem correctly. >> >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Dr. Axel Kohlmeyer? ? akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From nt_mahmood at yahoo.com Mon Feb 13 07:56:46 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 13 Feb 2012 06:56:46 -0800 (PST) Subject: [torqueusers] submitting a job, then modify script and resubmit In-Reply-To: References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com> Message-ID: <1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com> That sounds bad... Why it doesn't copy all necessary files for a job? ? // Naderan *Mahmood; ----- Original Message ----- From: Axel Kohlmeyer To: Mahmood Naderan Cc: torque cluster Sent: Monday, February 13, 2012 6:24 PM Subject: Re: [torqueusers] submitting a job, then modify script and resubmit On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote: > Thanks for your answer. I got that. > > Another question is about modification of other config files. Does qsub copy all files or it does only copy the script? it only copies the script. that is all it knows about. axel. > In this case, I have a scr.sh which is: > #PBS ... > ./run config.ini > > > the config.ini file looks like: > exe = bin1 > args = 8 > > The first qsub is: > qsub scr.sh > > After submitting that, I modify the config.ini to > > exe = bin1 > args = 64 > > Then I resubmit scr.sh again > > qsub scr.sh > > Please note that the scr.sh doesn't change in this case. However the config.ini is modified. > > > Does your answer apply to this case? > > Regards > // Naderan *Mahmood; > > > ________________________________ > From: Axel Kohlmeyer > To: Mahmood Naderan ; Torque Users Mailing List > Sent: Monday, February 13, 2012 5:33 PM > Subject: Re: [torqueusers] submitting a job, then modify script and resubmit > > On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote: >> Dear all, >> 1- Assume I have a script (named scr.sh) like this: >> >> #PBS -N run1 >> #PBS -V >> #PBS -l nodes=1 >> #PBS -q long >> #PBS -o /home/mahmood/run1.out >> #PBS -j oe >> cd $PBS_O_WORKDIR >> ./run1 config1 >> >> 2- Then I run >> qsub scr.sh >> >> 3- the jobs state is 'Q' since all cores are busy. That is fine... >> 4- while run1 is in 'Q', I reopen scr.sh and change it to >> >> #PBS -N run2 >> #PBS -V >> #PBS -l nodes=1 >> #PBS -q long >> #PBS -o /home/mahmood/run2.out >> #PBS -j oe >> cd $PBS_O_WORKDIR >> ./run1 config2 >> >> >> 5- Then I run >> qsub scr.sh >> >> 6- this job also is in 'Q' until some cores become free. >> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. >> >> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? >> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? > > qsub makes a copy of the submit script and the batch system will > execute that. that is the only way how it can consistently feeding > submission from standard input. > > axel. > > >> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. >> Hope that I state the problem correctly. >> >> >> // Naderan *Mahmood; >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Dr. Axel Kohlmeyer? ? akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From akohlmey at cmm.chem.upenn.edu Mon Feb 13 08:01:46 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 13 Feb 2012 10:01:46 -0500 Subject: [torqueusers] submitting a job, then modify script and resubmit In-Reply-To: <1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com> References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com> <1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: On Mon, Feb 13, 2012 at 9:56 AM, Mahmood Naderan wrote: > That sounds bad... > Why it doesn't copy all necessary files for a job? how should it know? remember the script interpreter for the batch script can be *anything* that reads a text file. it doesn't have to be a shell program. and the arguments and commands in the script can be computed on the fly. how to know in advance. if you want consistent behavior, you have to make the copies yourself, e.g. create a subdirectory for each submission and copy stuff in there from a wrapper script to qsub. you are the only person to set this up, since you know which files will be needed by a job. only the submit script is required. axel. > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Axel Kohlmeyer > To: Mahmood Naderan > Cc: torque cluster > Sent: Monday, February 13, 2012 6:24 PM > Subject: Re: [torqueusers] submitting a job, then modify script and resubmit > > On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote: >> Thanks for your answer. I got that. >> >> Another question is about modification of other config files. Does qsub copy all files or it does only copy the script? > > it only copies the script. that is all it knows about. > > axel. > >> In this case, I have a scr.sh which is: >> #PBS ... >> ./run config.ini >> >> >> the config.ini file looks like: >> exe = bin1 >> args = 8 >> >> The first qsub is: >> qsub scr.sh >> >> After submitting that, I modify the config.ini to >> >> exe = bin1 >> args = 64 >> >> Then I resubmit scr.sh again >> >> qsub scr.sh >> >> Please note that the scr.sh doesn't change in this case. However the config.ini is modified. >> >> >> Does your answer apply to this case? >> >> Regards >> // Naderan *Mahmood; >> >> >> ________________________________ >> From: Axel Kohlmeyer >> To: Mahmood Naderan ; Torque Users Mailing List >> Sent: Monday, February 13, 2012 5:33 PM >> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit >> >> On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote: >>> Dear all, >>> 1- Assume I have a script (named scr.sh) like this: >>> >>> #PBS -N run1 >>> #PBS -V >>> #PBS -l nodes=1 >>> #PBS -q long >>> #PBS -o /home/mahmood/run1.out >>> #PBS -j oe >>> cd $PBS_O_WORKDIR >>> ./run1 config1 >>> >>> 2- Then I run >>> qsub scr.sh >>> >>> 3- the jobs state is 'Q' since all cores are busy. That is fine... >>> 4- while run1 is in 'Q', I reopen scr.sh and change it to >>> >>> #PBS -N run2 >>> #PBS -V >>> #PBS -l nodes=1 >>> #PBS -q long >>> #PBS -o /home/mahmood/run2.out >>> #PBS -j oe >>> cd $PBS_O_WORKDIR >>> ./run1 config2 >>> >>> >>> 5- Then I run >>> qsub scr.sh >>> >>> 6- this job also is in 'Q' until some cores become free. >>> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. >>> >>> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? >>> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? >> >> qsub makes a copy of the submit script and the batch system will >> execute that. that is the only way how it can consistently feeding >> submission from standard input. >> >> axel. >> >> >>> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. >>> Hope that I state the problem correctly. >>> >>> >>> // Naderan *Mahmood; >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> -- >> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com >> http://sites.google.com/site/akohlmey/ >> >> Institute for Computational Molecular Science >> Temple University, Philadelphia PA, USA. > > > > -- > Dr. Axel Kohlmeyer? ? akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From nt_mahmood at yahoo.com Mon Feb 13 08:06:50 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 13 Feb 2012 07:06:50 -0800 (PST) Subject: [torqueusers] submitting a job, then modify script and resubmit In-Reply-To: References: <1329138924.737.YahooMailNeo@web111712.mail.gq1.yahoo.com> <1329144530.34446.YahooMailNeo@web111714.mail.gq1.yahoo.com> <1329145006.7678.YahooMailNeo@web111712.mail.gq1.yahoo.com> Message-ID: <1329145610.26261.YahooMailNeo@web111717.mail.gq1.yahoo.com> ok thanks. Its my job then ? // Naderan *Mahmood; ----- Original Message ----- From: Axel Kohlmeyer To: Mahmood Naderan Cc: torque cluster Sent: Monday, February 13, 2012 6:31 PM Subject: Re: [torqueusers] submitting a job, then modify script and resubmit On Mon, Feb 13, 2012 at 9:56 AM, Mahmood Naderan wrote: > That sounds bad... > Why it doesn't copy all necessary files for a job? how should it know? remember the script interpreter for the batch script can be *anything* that reads a text file. it doesn't have to be a shell program. and the arguments and commands in the script can be computed on the fly. how to know in advance. if you want consistent behavior, you have to make the copies yourself, e.g. create a subdirectory for each submission and copy stuff in there from a wrapper script to qsub. you are the only person to set this up, since you know which files will be needed by a job. only the submit script is required. axel. > > > // Naderan *Mahmood; > > > ----- Original Message ----- > From: Axel Kohlmeyer > To: Mahmood Naderan > Cc: torque cluster > Sent: Monday, February 13, 2012 6:24 PM > Subject: Re: [torqueusers] submitting a job, then modify script and resubmit > > On Mon, Feb 13, 2012 at 9:48 AM, Mahmood Naderan wrote: >> Thanks for your answer. I got that. >> >> Another question is about modification of other config files. Does qsub copy all files or it does only copy the script? > > it only copies the script. that is all it knows about. > > axel. > >> In this case, I have a scr.sh which is: >> #PBS ... >> ./run config.ini >> >> >> the config.ini file looks like: >> exe = bin1 >> args = 8 >> >> The first qsub is: >> qsub scr.sh >> >> After submitting that, I modify the config.ini to >> >> exe = bin1 >> args = 64 >> >> Then I resubmit scr.sh again >> >> qsub scr.sh >> >> Please note that the scr.sh doesn't change in this case. However the config.ini is modified. >> >> >> Does your answer apply to this case? >> >> Regards >> // Naderan *Mahmood; >> >> >> ________________________________ >> From: Axel Kohlmeyer >> To: Mahmood Naderan ; Torque Users Mailing List >> Sent: Monday, February 13, 2012 5:33 PM >> Subject: Re: [torqueusers] submitting a job, then modify script and resubmit >> >> On Mon, Feb 13, 2012 at 8:15 AM, Mahmood Naderan wrote: >>> Dear all, >>> 1- Assume I have a script (named scr.sh) like this: >>> >>> #PBS -N run1 >>> #PBS -V >>> #PBS -l nodes=1 >>> #PBS -q long >>> #PBS -o /home/mahmood/run1.out >>> #PBS -j oe >>> cd $PBS_O_WORKDIR >>> ./run1 config1 >>> >>> 2- Then I run >>> qsub scr.sh >>> >>> 3- the jobs state is 'Q' since all cores are busy. That is fine... >>> 4- while run1 is in 'Q', I reopen scr.sh and change it to >>> >>> #PBS -N run2 >>> #PBS -V >>> #PBS -l nodes=1 >>> #PBS -q long >>> #PBS -o /home/mahmood/run2.out >>> #PBS -j oe >>> cd $PBS_O_WORKDIR >>> ./run1 config2 >>> >>> >>> 5- Then I run >>> qsub scr.sh >>> >>> 6- this job also is in 'Q' until some cores become free. >>> 7- After some hours, two cores become free and both "run1" and "run2" change to 'R'. >>> >>> The question is, when run1 change to 'R', will it read its own scr.sh (which contain "./run config1")? >>> Since the latest modification to scr.sh belongs to run2, I wonder what happen to run1? >> >> qsub makes a copy of the submit script and the batch system will >> execute that. that is the only way how it can consistently feeding >> submission from standard input. >> >> axel. >> >> >>> You may say that I can test that myself to see the result, but since they stay in 'Q' for long time, I want to be sure about what I am doing. >>> Hope that I state the problem correctly. >>> >>> >>> // Naderan *Mahmood; >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> -- >> Dr. Axel Kohlmeyer? ? akohlmey at gmail.com >> http://sites.google.com/site/akohlmey/ >> >> Institute for Computational Molecular Science >> Temple University, Philadelphia PA, USA. > > > > -- > Dr. Axel Kohlmeyer? ? akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From gus at ldeo.columbia.edu Mon Feb 13 08:53:33 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Mon, 13 Feb 2012 10:53:33 -0500 Subject: [torqueusers] Setting up torque on PC to test scripts? In-Reply-To: References: Message-ID: On Feb 13, 2012, at 6:42 AM, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi > > I am thinking of setting up a dummy cluster on my local PC to test my > submit scripts. Is this easily possible? Or is there even a virtual > machine to download which has torque installed? That would be the easiest. > > Thanks, > > Rainer > Hi Rainer We did this on a few machines here, and a lot of people also use Torque in a single machine, as it is very convenient to schedule and control jobs. Many Linux distributions have Torque packages. That is the fast way to install it. Check with your 'yum', 'apt-get' or similar. However, I don't think there is Maui also, but I haven't checked this lately. If you want full control, choose the release, etc, you can download and install both from source in your standalone machine. Then run the three daemons, pbs_server, maui [or if you prefer it, pbs_server], and pbs_mom on that machine. The setup for queues, etc, is about the same. It is simpler if you allow the server name just default to 'localhost'. I hope this helps, Gus Correa > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk8490EACgkQoYgNqgF2egqJDgCfXYX1gv1i2B/ABAaZjoEtg44C > awMAniAtzHKb8znKxGmbHJUDOcFs5ohT > =RUyd > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Michael.Zulauf at iberdrolaren.com Mon Feb 13 16:22:36 2012 From: Michael.Zulauf at iberdrolaren.com (Zulauf, Michael) Date: Mon, 13 Feb 2012 15:22:36 -0800 Subject: [torqueusers] problem with jobs sharing cores Message-ID: A big thinks to Ken Nielson, Jim Coyle, and Fotis Georgatos - based on their help I made some significant progress today. I haven't completely worked out all the details yet, but I've found that by switching to OpenMPI, I do not get the same problematic behavior. So it seems most likely that the source of the problem has something to do with a configuration detail of our mvapich2 installation. According to some earlier benchmarking I'd done, mvapich2 seems to offer better performance across our infiniband interconnect, so I'd like to see if I can get to the bottom of the issue with that. Alternatively, I could use mvapich2 for the "large" jobs (which span multiple nodes), and use OpenMPI for the "small" jobs (which will share a node with other jobs). I'd prefer to avoid the "dual MPI" alternative, as then we'd have to build all executables twice, and some of them are a bit tricky. Still, I suppose it's an option. In any case, thanks again. Now maybe I can go haunt the mvapich2 lists, or at least start trying to dig up the solution in that documentation. Happy computing, Mike -- Mike Zulauf Meteorologist, Lead Senior Asset Optimization Iberdrola Renewables 1125 NW Couch, Suite 700 Portland, OR 97209 Office: 503-478-6304 Cell: 503-913-0403 This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120213/450d7a4d/attachment-0001.html From cholam20 at yahoo.co.in Mon Feb 13 22:57:05 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Tue, 14 Feb 2012 11:27:05 +0530 (IST) Subject: [torqueusers] this has been your time to shine... Message-ID: <1329199025.85043.androidMobile@web137305.mail.in.yahoo.com>

it was so difficult living paycheck to paycheck this is the best thing that ever happened to me I had hit an all time low.
http://empresas-web.es/breakingnews/78DeanThomas/ im finally starting to advance in life
I wouldnt waste your time!

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/ca861235/attachment.html From r.m.krug at gmail.com Tue Feb 14 02:03:13 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Tue, 14 Feb 2012 10:03:13 +0100 Subject: [torqueusers] Setting up torque on PC to test scripts? In-Reply-To: References: Message-ID: <4F3A2351.6090307@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/02/12 16:53, Gustavo Correa wrote: > > On Feb 13, 2012, at 6:42 AM, Rainer M Krug wrote: > > Hi > > I am thinking of setting up a dummy cluster on my local PC to test > my submit scripts. Is this easily possible? Or is there even a > virtual machine to download which has torque installed? That would > be the easiest. > > Thanks, > > Rainer > >> Hi Rainer > >> We did this on a few machines here, and a lot of people also use >> Torque in a single machine, as it is very convenient to schedule >> and control jobs. Interesting - I'll look into that. > >> Many Linux distributions have Torque packages. That is the fast >> way to install it. Check with your 'yum', 'apt-get' or similar. I am on Ubuntu and torque is there - great. I am just in the process of installing it. >> However, I don't think there is Maui also, but I haven't checked >> this lately. > >> If you want full control, choose the release, etc, you can >> download and install both from source in your standalone >> machine. Then run the three daemons, pbs_server, maui [or if you >> prefer it, pbs_server], and pbs_mom on that machine. The setup >> for queues, etc, is about the same. It is simpler if you allow >> the server name just default to 'localhost'. Sounds easy - do I have to configure afterwards manually, or is the default fine? It would be a single machine (dual core) install? Rainer > >> I hope this helps, Gus Correa > >> >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk86I1AACgkQoYgNqgF2egpWMwCfYGlrry48mRG2cwbmkk6zy+ev yFkAn1sDD4+55ApGs5aqIjZrbYaNyNzs =5qXW -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Tue Feb 14 02:03:13 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Tue, 14 Feb 2012 10:03:13 +0100 Subject: [torqueusers] Setting up torque on PC to test scripts? In-Reply-To: References: Message-ID: <4F3A2351.6090307@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 13/02/12 16:53, Gustavo Correa wrote: > > On Feb 13, 2012, at 6:42 AM, Rainer M Krug wrote: > > Hi > > I am thinking of setting up a dummy cluster on my local PC to test > my submit scripts. Is this easily possible? Or is there even a > virtual machine to download which has torque installed? That would > be the easiest. > > Thanks, > > Rainer > >> Hi Rainer > >> We did this on a few machines here, and a lot of people also use >> Torque in a single machine, as it is very convenient to schedule >> and control jobs. Interesting - I'll look into that. > >> Many Linux distributions have Torque packages. That is the fast >> way to install it. Check with your 'yum', 'apt-get' or similar. I am on Ubuntu and torque is there - great. I am just in the process of installing it. >> However, I don't think there is Maui also, but I haven't checked >> this lately. > >> If you want full control, choose the release, etc, you can >> download and install both from source in your standalone >> machine. Then run the three daemons, pbs_server, maui [or if you >> prefer it, pbs_server], and pbs_mom on that machine. The setup >> for queues, etc, is about the same. It is simpler if you allow >> the server name just default to 'localhost'. Sounds easy - do I have to configure afterwards manually, or is the default fine? It would be a single machine (dual core) install? Rainer > >> I hope this helps, Gus Correa > >> >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk86I1AACgkQoYgNqgF2egpWMwCfYGlrry48mRG2cwbmkk6zy+ev yFkAn1sDD4+55ApGs5aqIjZrbYaNyNzs =5qXW -----END PGP SIGNATURE----- From nt_mahmood at yahoo.com Tue Feb 14 03:04:02 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Tue, 14 Feb 2012 02:04:02 -0800 (PST) Subject: [torqueusers] finding node name that run a job Message-ID: <1329213842.50370.YahooMailNeo@web111702.mail.gq1.yahoo.com> Hi, How can I find which node is running a specific job? If I use "nodes=1", then "qstat -a" also shows "1" in TSK column. // Naderan *Mahmood; From basv at sara.nl Tue Feb 14 03:25:13 2012 From: basv at sara.nl (Bas van der Vlies) Date: Tue, 14 Feb 2012 11:25:13 +0100 Subject: [torqueusers] finding node name that run a job In-Reply-To: <1329213842.50370.YahooMailNeo@web111702.mail.gq1.yahoo.com> References: <1329213842.50370.YahooMailNeo@web111702.mail.gq1.yahoo.com> Message-ID: <4F3A3689.7070200@sara.nl> On 02/14/2012 11:04 AM, Mahmood Naderan wrote: > Hi, > How can I find which node is running a specific job? > If I use "nodes=1", then "qstat -a" also shows "1" in TSK column. > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers all jobs) * qstat -rn specific job: * qstat -n For more options: * man qstat -- ******************************************************************** * Bas van der Vlies e-mail: basv at sara.nl * * SARA - Academic Computing Services Amsterdam, The Netherlands * ******************************************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4553 bytes Desc: S/MIME Cryptographic Signature Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/f3ecd2d3/attachment.bin From ken at fq.edu.uy Tue Feb 14 03:58:09 2012 From: ken at fq.edu.uy (Kenneth Irving) Date: Tue, 14 Feb 2012 08:58:09 -0200 (UYST) Subject: [torqueusers] finding node name that run a job In-Reply-To: <1329213842.50370.YahooMailNeo@web111702.mail.gq1.yahoo.com> References: <1329213842.50370.YahooMailNeo@web111702.mail.gq1.yahoo.com> Message-ID: Try qstat -f In the listing you should see "exec_host" with the name or names of the nodes executing your job. best regards Kenneth On Tue, 14 Feb 2012, Mahmood Naderan wrote: > Hi, > How can I find which node is running a specific job? > If I use "nodes=1", then "qstat -a" also shows "1" in TSK column. > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From nt_mahmood at yahoo.com Tue Feb 14 04:31:50 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Tue, 14 Feb 2012 03:31:50 -0800 (PST) Subject: [torqueusers] finding node name that run a job In-Reply-To: References: <1329213842.50370.YahooMailNeo@web111702.mail.gq1.yahoo.com> Message-ID: <1329219110.86018.YahooMailNeo@web111720.mail.gq1.yahoo.com> ok thanks. qstat -rn is fine ? // Naderan *Mahmood; ----- Original Message ----- From: Kenneth Irving To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Tuesday, February 14, 2012 2:28 PM Subject: Re: [torqueusers] finding node name that run a job Try qstat -f In the listing you should see "exec_host" with the name or names of the nodes executing your job. best regards Kenneth On Tue, 14 Feb 2012, Mahmood Naderan wrote: > Hi, > How can I find which node is running a specific job? > If I use "nodes=1", then "qstat -a" also shows "1" in TSK column. > > // Naderan *Mahmood; > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From christina.salls at noaa.gov Tue Feb 14 07:36:35 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 14 Feb 2012 09:36:35 -0500 Subject: [torqueusers] Basic torque config Message-ID: Hi all, I finally made some progress but am not all the way there yet. I changed the hostname of the server to admin, which is the hostname assigned to the interface that the compute nodes are physically connected to. Now my pbsnodes command shows the nodes as free!! [root at wings torque]# pbsnodes -a n001.default.domain state = free np = 1 ntype = cluster status = rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n002.default.domain state = free np = 1 ntype = cluster status = rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 ....For all 20 nodes. And now when I submit a job, I get a job id back, however, the jobs stays in the queue state. -bash-4.1$ ./example_submit_script_1 Fri Feb 10 15:46:35 CST 2012 Fri Feb 10 15:46:45 CST 2012 -bash-4.1$ ./example_submit_script_1 | qsub 6.admin.default.domain -bash-4.1$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 4.wings STDIN salls 0 Q batch 5.wings STDIN salls 0 Q batch 6.admin STDIN salls 0 Q batch I deleted the two jobs that were created when wings was the server in case they were getting in the way [root at wings torque]# qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 6.admin STDIN salls 0 Q batch [root at wings torque]# qstat -a admin.default.domain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6.admin.default. salls batch STDIN -- -- -- -- -- Q -- [root at wings torque]# I don't see anything that seems significant in the logs: Lots of entries like this in the server log: 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 This is the entirety of the sched_log: 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120210 opened 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address already in use (98) in main, bind 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed mom logs on the compute nodes have the same multiple entries: 02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 ps looks like this: -bash-4.1$ ps -ef | grep pbs root 12576 1 0 Feb10 ? 00:00:00 pbs_sched salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H admin.default.domain The server and queue settings are as follows: Qmgr: list server Server admin.default.domain server_state = Active scheduling = True total_jobs = 1 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 acl_hosts = admin.default.domain,wings.glerl.noaa.gov default_queue = batch log_events = 511 mail_from = adm scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 mom_job_sync = True pbs_version = 2.5.9 keep_completed = 300 next_job_number = 7 net_counter = 1 0 0 Qmgr: list queue batch Queue batch queue_type = Execution Priority = 100 total_jobs = 1 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 max_running = 300 mtime = Thu Feb 9 18:22:33 2012 enabled = True started = True Do I need to create a routing queue? It seems like I am missing a basic element here. Thanks in advance, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/8eec4186/attachment.html From gas5x at yahoo.com Tue Feb 14 08:36:59 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Tue, 14 Feb 2012 07:36:59 -0800 (PST) Subject: [torqueusers] Basic torque config In-Reply-To: Message-ID: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> Do you have a scheduler installed? Like, Maui, Moab? --- On Tue, 2/14/12, Christina Salls wrote: From: Christina Salls Subject: [torqueusers] Basic torque config To: "Torque Users Mailing List" , "Brian Beagan" , "John Cardenas" , "Jeff Hanson" , "Michael Saxon" , "help >> GLERL IT Help" , keenandr at msu.edu Date: Tuesday, February 14, 2012, 6:36 AM Hi all, ? ? ? I finally made some progress but am not all the way there yet. ?I changed the hostname of the server to admin, which is the hostname assigned to the interface that the compute nodes are physically connected to. ?Now my pbsnodes command shows the nodes as free!! [root at wings torque]# pbsnodes -an001.default.domain? ? ?state = free? ? ?np = 1? ? ?ntype = cluster? ? ?status = rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux ? ? ?gpus = 0 n002.default.domain? ? ?state = free? ? ?np = 1? ? ?ntype = cluster? ? ?status = rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux ? ? ?gpus = 0?....For all 20 nodes. And now when I submit a job, I get a job id back, however, the jobs stays in the queue state. ? -bash-4.1$ ./example_submit_script_1? Fri Feb 10 15:46:35 CST 2012Fri Feb 10 15:46:45 CST 2012-bash-4.1$ ./example_submit_script_1 | qsub6.admin.default.domain-bash-4.1$ qstatJob id ? ? ? ? ? ? ? ? ? ?Name ? ? ? ? ? ? User ? ? ? ? ? ?Time Use S Queue ------------------------- ---------------- --------------- -------- - -----4.wings ? ? ? ? ? ? ? ? ? ?STDIN ? ? ? ? ? ?salls ? ? ? ? ? ? ? ? ?0 Q batch ? ? ? ? ?5.wings ? ? ? ? ? ? ? ? ? ?STDIN ? ? ? ? ? ?salls ? ? ? ? ? ? ? ? ?0 Q batch ? ? ? ? ? 6.admin ? ? ? ? ? ? ? ? ? ?STDIN ? ? ? ? ? ?salls ? ? ? ? ? ? ? ? ?0 Q batch ? I deleted the two jobs that were created when wings was the server in case they were getting in the way [root at wings torque]# qstatJob id ? ? ? ? ? ? ? ? ? ?Name ? ? ? ? ? ? User ? ? ? ? ? ?Time Use S Queue------------------------- ---------------- --------------- -------- - ----- 6.admin ? ? ? ? ? ? ? ? ? ?STDIN ? ? ? ? ? ?salls ? ? ? ? ? ? ? ? ?0 Q batch ? ? ? ? ?[root at wings torque]# qstat -a admin.default.domain:?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Req'd ?Req'd ? Elap Job ID ? ? ? ? ? ? ? Username Queue ? ?Jobname ? ? ? ? ?SessID NDS ? TSK Memory Time ?S Time-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----6.admin.default. ? ? salls ? ?batch ? ?STDIN ? ? ? ? ? ? ? -- ? ?-- ? -- ? ?-- ? ?-- ?Q ? --? [root at wings torque]#??? ? ? ? I don't see anything that seems significant in the logs: Lots of entries like this in the server log: 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 002/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 002/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 This is the entirety of the sched_log: 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120210 opened 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 1257602/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 684851202/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address already in use (98) in main, bind02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed mom logs on the compute nodes have the same multiple entries: 02/14/2012 08:03:00;0002; ? pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:08:00;0002; ? pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/14/2012 08:13:00;0002; ? pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:18:00;0002; ? pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 002/14/2012 08:23:00;0002; ? pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 ps looks like this: -bash-4.1$ ps -ef | grep pbsroot ? ? 12576 ? ? 1 ?0 Feb10 ? ? ? ? ?00:00:00 pbs_schedsalls ? ?12727 26862 ?0 08:19 pts/0 ? ?00:00:00 grep pbs root ? ? 25810 ? ? 1 ?0 Feb10 ? ? ? ? ?00:00:25 pbs_server -H admin.default.domain The server and queue settings are as follows: Qmgr: list server Server admin.default.domain server_state = Active scheduling = True total_jobs = 1 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0? acl_hosts = admin.default.domain,wings.glerl.noaa.gov default_queue = batch log_events = 511 mail_from = adm scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 mom_job_sync = True pbs_version = 2.5.9 keep_completed = 300 next_job_number = 7 net_counter = 1 0 0 Qmgr: list queue batch Queue batch queue_type = Execution Priority = 100 total_jobs = 1 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0? max_running = 300 mtime = Thu Feb ?9 18:22:33 2012 enabled = True started = True Do I need to create a routing queue? ?It seems like I am missing a basic element here. ? Thanks in advance, Christina -- Christina A. SallsGLERL Computer Grouphelp.glerl at noaa.govHelp Desk x2127Christina.Salls at noaa.gov Voice Mail 734-741-2446? -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/5d2230b7/attachment-0001.html From christina.salls at noaa.gov Tue Feb 14 08:54:45 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 14 Feb 2012 10:54:45 -0500 Subject: [torqueusers] Basic torque config In-Reply-To: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> References: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> Message-ID: On Tue, Feb 14, 2012 at 10:36 AM, Grigory Shamov wrote: > Do you have a scheduler installed? Like, Maui, Moab? > No I don't. My plan is to run Torque on a single cluster with one head node and 20 compute nodes. The user base is currently around 5 and may increase to 10. We are simply trying to manage the resource (in probably a FIFO manner) I was hoping to get away with the Torque scheduler because of the simplicity of the config. Do you think that is possible? > > > --- On *Tue, 2/14/12, Christina Salls * wrote: > > > From: Christina Salls > Subject: [torqueusers] Basic torque config > To: "Torque Users Mailing List" , "Brian > Beagan" , "John Cardenas" , "Jeff > Hanson" , "Michael Saxon" , "help >> > GLERL IT Help" , keenandr at msu.edu > Date: Tuesday, February 14, 2012, 6:36 AM > > > Hi all, > > I finally made some progress but am not all the way there yet. I > changed the hostname of the server to admin, which is the hostname assigned > to the interface that the compute nodes are physically connected to. Now > my pbsnodes command shows the nodes as free!! > > [root at wings torque]# pbsnodes -a > n001.default.domain > state = free > np = 1 > ntype = cluster > status = > rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? > 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > n002.default.domain > state = free > np = 1 > ntype = cluster > status = > rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? > 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > ....For all 20 nodes. > > And now when I submit a job, I get a job id back, however, the jobs stays > in the queue state. > > -bash-4.1$ ./example_submit_script_1 > Fri Feb 10 15:46:35 CST 2012 > Fri Feb 10 15:46:45 CST 2012 > -bash-4.1$ ./example_submit_script_1 | qsub > 6.admin.default.domain > -bash-4.1$ qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 4.wings STDIN salls 0 Q > batch > 5.wings STDIN salls 0 Q > batch > 6.admin STDIN salls 0 Q > batch > > I deleted the two jobs that were created when wings was the server in case > they were getting in the way > > [root at wings torque]# qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 6.admin STDIN salls 0 Q > batch > [root at wings torque]# qstat -a > > admin.default.domain: > > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > 6.admin.default. salls batch STDIN -- -- -- > -- -- Q -- > [root at wings torque]# > > > I don't see anything that seems significant in the logs: > > Lots of entries like this in the server log: > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 2.5.9, loglevel = 0 > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 2.5.9, loglevel = 0 > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = > 2.5.9, loglevel = 0 > > This is the entirety of the sched_log: > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file > /var/spool/torque/sched_priv/accounting/20120210 opened > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576 > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512 > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address > already in use (98) in main, bind > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed > > mom logs on the compute nodes have the same multiple entries: > > 02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > 02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > 02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > 02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > 02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > ps looks like this: > > -bash-4.1$ ps -ef | grep pbs > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs > root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H > admin.default.domain > > The server and queue settings are as follows: > > Qmgr: list server > Server admin.default.domain > server_state = Active > scheduling = True > total_jobs = 1 > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = admin.default.domain,wings.glerl.noaa.gov > default_queue = batch > log_events = 511 > mail_from = adm > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 2.5.9 > keep_completed = 300 > next_job_number = 7 > net_counter = 1 0 0 > > Qmgr: list queue batch > Queue batch > queue_type = Execution > Priority = 100 > total_jobs = 1 > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 > max_running = 300 > mtime = Thu Feb 9 18:22:33 2012 > enabled = True > started = True > > Do I need to create a routing queue? It seems like I am missing a basic > element here. > > Thanks in advance, > > Christina > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > -----Inline Attachment Follows----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/e12907b7/attachment.html From gus at ldeo.columbia.edu Tue Feb 14 11:11:22 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 14 Feb 2012 13:11:22 -0500 Subject: [torqueusers] Setting up torque on PC to test scripts? In-Reply-To: <4F3A2351.6090307@gmail.com> References: <4F3A2351.6090307@gmail.com> Message-ID: <9C6E62CD-9AA9-4906-8ADD-474F3860128D@ldeo.columbia.edu> On Feb 14, 2012, at 4:03 AM, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 13/02/12 16:53, Gustavo Correa wrote: >> >> On Feb 13, 2012, at 6:42 AM, Rainer M Krug wrote: >> >> Hi >> >> I am thinking of setting up a dummy cluster on my local PC to test >> my submit scripts. Is this easily possible? Or is there even a >> virtual machine to download which has torque installed? That would >> be the easiest. >> >> Thanks, >> >> Rainer >> >>> Hi Rainer >> >>> We did this on a few machines here, and a lot of people also use >>> Torque in a single machine, as it is very convenient to schedule >>> and control jobs. > > Interesting - I'll look into that. > >> >>> Many Linux distributions have Torque packages. That is the fast >>> way to install it. Check with your 'yum', 'apt-get' or similar. > > I am on Ubuntu and torque is there - great. > > I am just in the process of installing it. > >>> However, I don't think there is Maui also, but I haven't checked >>> this lately. >> >>> If you want full control, choose the release, etc, you can >>> download and install both from source in your standalone >>> machine. Then run the three daemons, pbs_server, maui [or if you >>> prefer it, pbs_server], and pbs_mom on that machine. The setup >>> for queues, etc, is about the same. It is simpler if you allow >>> the server name just default to 'localhost'. > > Sounds easy - do I have to configure afterwards manually, or is the > default fine? It would be a single machine (dual core) install? > > Rainer Not sure about Ubuntu. On Fedora it was plug and play, the defaults were fine. All I needed to do was to setup queues, etc, and use chkconfig [now systemctl, argh!] to start pbs_server, pbs_sched, and pbs_mom during init/boot. I hope this helps, Gus Correa >> >>> I hope this helps, Gus Correa >> >>> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk86I1AACgkQoYgNqgF2egpWMwCfYGlrry48mRG2cwbmkk6zy+ev > yFkAn1sDD4+55ApGs5aqIjZrbYaNyNzs > =5qXW > -----END PGP SIGNATURE----- From lance at quantumbioinc.com Tue Feb 14 11:11:48 2012 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Tue, 14 Feb 2012 13:11:48 -0500 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> Message-ID: <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> Hello All- We're still having trouble with this feature, and we are starting to shop around for a torque/maui replacement in order to be able to use it. Before we do that however, I wanted to see if anyone has any thoughts on how to address the problem within torque/maui. Perhaps I simply don't understand the feature. The versions of torque and maui we are using are. torque-3.0.2 maui-3.2.6p21 Yes, we have tried newer versions of maui, but then the option doesn't work at all. Here is the scenario (I also included the conversation from November below for more information). Conceptually, our software is almost infinitely scalable in the sense that there is very little overhead associated with interprocess communication. Therefore, we do not require that all of the processes reside on a small number of nodes. In fact, we can stretch the processors to any and all nodes in the cluster with ~zero loss in performance. So we can literally have one node that has a single process running and another node that has 8 processes running. Since we have that level of scalability, we don't want to have to lock ourselves into having to request resources using the "nodes=X:ppn=Y" style since this style requires that nodes open up or drain in order to use them. Since our users have a big mixture of single and multi-processor jobs, waiting for node drain can really waste a lot of resources. I saw the "procs=#" the Requesting Resources table (see http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resources for more). It *appears* that this option should be able to allow the user to request simply X*Y processors and the scheduler should be able to schedule them any way it can fit. So using the following #PBS note, we should be able to request 40 processors: #PBS -l procs=40 Instead, we see that the scheduler seems to take this information, read it, and basically disregard it. The reason I know it reads it is because if I ask for say 40 processors and 40 processors are available in the cluster, it works as expected and all is right with the world. Where it gets a bit more choppy is when I ask for 40 processors and only 1 processor is available. The job doesn't wait in the queue for the remaining 39 processors to open up, and instead PBS simply just starts the job on that processor. I can't see how that is anything but a bug. If the user is asking for 40 processors, why isn't the scheduler waiting for all 40 processors to open up? I'll also post this to the maui list so I apologize if you receive it twice. I'm just not sure if this is a problem with torque, maui, or both. If answering this question will require additional information, please ask. We are at our wits end here. Thanks! -Lance On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: > > Hi Steve- > > Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. > > Thanks for your help! > > =============== > > + head job.pbs > > #!/bin/bash > #PBS -S /bin/bash > #PBS -l procs=100 > #PBS -l pmem=700mb > #PBS -l walltime=744:00:00 > #PBS -j oe > #PBS -q batch > > Report run on Fri Nov 18 10:49:38 EST 2011 > + pbsnodes --version > version: 3.0.2 > + diagnose --version > maui client version 3.2.6p21 > + checkjob 371010 > > > checking job 371010 > > State: Running > Creds: user:josh group:games class:batch qos:DEFAULT > WallTime: 00:02:35 of 31:00:00:00 > SubmitTime: Fri Nov 18 10:46:33 > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > StartTime: Fri Nov 18 10:46:34 > Total Tasks: 1 > > Req[0] TaskCount: 26 Partition: DEFAULT > Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [NONE] > Dedicated Resources Per Task: PROCS: 1 MEM: 700M > NodeCount: 10 > Allocated Nodes: > [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] > [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] > [compute-0-13:2][compute-0-14:2] > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 1 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00) > PE: 26.00 StartPriority: 4716 > > + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" > SERVERHOST gondor > ADMIN1 maui root > ADMIN3 ALL > RMCFG[base] TYPE=PBS > AMCFG[bank] TYPE=NONE > RMPOLLINTERVAL 00:01:00 > SERVERPORT 42559 > SERVERMODE NORMAL > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > QUEUETIMEWEIGHT 1 > FSPOLICY DEDICATEDPS > FSDEPTH 7 > FSINTERVAL 86400 > FSDECAY 0.50 > FSWEIGHT 200 > FSUSERWEIGHT 1 > FSGROUPWEIGHT 1000 > FSQOSWEIGHT 1000 > FSACCOUNTWEIGHT 1 > FSCLASSWEIGHT 1000 > USERWEIGHT 4 > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEALLOCATIONPOLICY MINRESOURCE > RESERVATIONDEPTH 8 > MAXJOBPERUSERPOLICY OFF > MAXJOBPERUSERCOUNT 8 > MAXPROCPERUSERPOLICY OFF > MAXPROCPERUSERCOUNT 256 > MAXPROCSECONDPERUSERPOLICY OFF > MAXPROCSECONDPERUSERCOUNT 36864000 > MAXJOBQUEUEDPERUSERPOLICY OFF > MAXJOBQUEUEDPERUSERCOUNT 2 > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SHARED > JOBMAXOVERRUN 99:00:00:00 > DEFERCOUNT 8192 > DEFERTIME 0 > CLASSCFG[developer] FSTARGET=40.00+ > CLASSCFG[lowprio] PRIORITY=-1000 > SRCFG[developer] CLASSLIST=developer > SRCFG[developer] ACCESS=dedicated > SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri > SRCFG[developer] STARTTIME=08:00:00 > SRCFG[developer] ENDTIME=18:00:00 > SRCFG[developer] TIMELIMIT=2:00:00 > SRCFG[developer] RESOURCES=PROCS(8) > USERCFG[DEFAULT] FSTARGET=100.0 > > =============== > > -Lance > > > On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> >> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >> >>> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). >> >> Hi Lance, >> >> Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. >> >> Might as well add your maui.cfg file also. >> >> I've found in the past that procs= is troublesome... >> >>> >>> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >>> >>> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). >> >> Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. >> >>> >>> -Lance >>> >>> >>>> >>>> Message: 3 >>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>> From: "Brock Palen" >>>> Subject: Re: [torqueusers] procs= not working as documented >>>> To: "Torque Users Mailing List" >>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>> >>>> >>>> >>>> Brock Palen >>>> (734)936-1985 >>>> brockp at umich.edu >>>> - Sent from my Palm Pre, please excuse typos >>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>>> >>>> >>>> >>>> Hello All- >>>> >>>> >>>> >>>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>>> >>>> >>>> >>>> ========================================== >>>> >>>> >>>> >>>> #PBS -S /bin/bash >>>> >>>> #PBS -l procs=60 >>>> >>>> #PBS -l pmem=700mb >>>> >>>> #PBS -l walltime=744:00:00 >>>> >>>> #PBS -j oe >>>> >>>> #PBS -q batch >>>> >>>> >>>> >>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>>> >>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>>> >>>> >>>> >>>> ========================================== >>>> >>>> >>>> >>>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>>> >>>> >>>> >>>> Thank you for your time! >>>> >>>> >>>> >>>> -Lance >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> ---------------------- >> Steve Crusan >> System Administrator >> Center for Research Computing >> University of Rochester >> https://www.crc.rochester.edu/ >> >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >> Comment: GPGTools - http://gpgtools.org >> >> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >> =AGW7 >> -----END PGP SIGNATURE----- >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From gus at ldeo.columbia.edu Tue Feb 14 11:24:10 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 14 Feb 2012 13:24:10 -0500 Subject: [torqueusers] Basic torque config In-Reply-To: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> References: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> Message-ID: <789FA99A-FFE5-488D-A8F7-0E7957477D91@ldeo.columbia.edu> Make sure pbs_sched [Xor alternatively maui, if you installed it] is running. Also, as root, on the pbs_server computer, enable scheduling: qmgr -c 'set server scheduling=True' I hope this helps, Gus Correa On Feb 14, 2012, at 10:36 AM, Grigory Shamov wrote: > Do you have a scheduler installed? Like, Maui, Moab? > > > > > --- On Tue, 2/14/12, Christina Salls wrote: > > From: Christina Salls > Subject: [torqueusers] Basic torque config > To: "Torque Users Mailing List" , "Brian Beagan" , "John Cardenas" , "Jeff Hanson" , "Michael Saxon" , "help >> GLERL IT Help" , keenandr at msu.edu > Date: Tuesday, February 14, 2012, 6:36 AM > > Hi all, > > I finally made some progress but am not all the way there yet. I changed the hostname of the server to admin, which is the hostname assigned to the interface that the compute nodes are physically connected to. Now my pbsnodes command shows the nodes as free!! > > [root at wings torque]# pbsnodes -a > n001.default.domain > state = free > np = 1 > ntype = cluster > status = rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > n002.default.domain > state = free > np = 1 > ntype = cluster > status = rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > gpus = 0 > > ....For all 20 nodes. > > And now when I submit a job, I get a job id back, however, the jobs stays in the queue state. > > -bash-4.1$ ./example_submit_script_1 > Fri Feb 10 15:46:35 CST 2012 > Fri Feb 10 15:46:45 CST 2012 > -bash-4.1$ ./example_submit_script_1 | qsub > 6.admin.default.domain > -bash-4.1$ qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 4.wings STDIN salls 0 Q batch > 5.wings STDIN salls 0 Q batch > 6.admin STDIN salls 0 Q batch > > I deleted the two jobs that were created when wings was the server in case they were getting in the way > > [root at wings torque]# qstat > Job id Name User Time Use S Queue > ------------------------- ---------------- --------------- -------- - ----- > 6.admin STDIN salls 0 Q batch > [root at wings torque]# qstat -a > > admin.default.domain: > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > 6.admin.default. salls batch STDIN -- -- -- -- -- Q -- > [root at wings torque]# > > > I don't see anything that seems significant in the logs: > > Lots of entries like this in the server log: > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 > > This is the entirety of the sched_log: > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120210 opened > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576 > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512 > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address already in use (98) in main, bind > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed > > mom logs on the compute nodes have the same multiple entries: > > 02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > ps looks like this: > > -bash-4.1$ ps -ef | grep pbs > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs > root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H admin.default.domain > > The server and queue settings are as follows: > > Qmgr: list server > Server admin.default.domain > server_state = Active > scheduling = True > total_jobs = 1 > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 > acl_hosts = admin.default.domain,wings.glerl.noaa.gov > default_queue = batch > log_events = 511 > mail_from = adm > scheduler_iteration = 600 > node_check_rate = 150 > tcp_timeout = 6 > mom_job_sync = True > pbs_version = 2.5.9 > keep_completed = 300 > next_job_number = 7 > net_counter = 1 0 0 > > Qmgr: list queue batch > Queue batch > queue_type = Execution > Priority = 100 > total_jobs = 1 > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 > max_running = 300 > mtime = Thu Feb 9 18:22:33 2012 > enabled = True > started = True > > Do I need to create a routing queue? It seems like I am missing a basic element here. > > Thanks in advance, > > Christina > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > > -----Inline Attachment Follows----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From christina.salls at noaa.gov Tue Feb 14 11:28:32 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Tue, 14 Feb 2012 13:28:32 -0500 Subject: [torqueusers] Basic torque config In-Reply-To: <789FA99A-FFE5-488D-A8F7-0E7957477D91@ldeo.columbia.edu> References: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> <789FA99A-FFE5-488D-A8F7-0E7957477D91@ldeo.columbia.edu> Message-ID: On Tue, Feb 14, 2012 at 1:24 PM, Gustavo Correa wrote: > Make sure pbs_sched [Xor alternatively maui, if you installed it] is > running. > Thanks for the response. It appears to be running. [root at wings etc]# ps -ef | grep pbs root 6896 6509 0 12:25 pts/24 00:00:00 grep pbs root 12576 1 0 Feb10 ? 00:00:00 pbs_sched root 25810 1 0 Feb10 ? 00:00:26 pbs_server -H admin.default.domain > > Also, as root, on the pbs_server computer, enable scheduling: > qmgr -c 'set server scheduling=True' > And it appears that server scheduling is already set for True [root at wings etc]# qmgr Max open servers: 10239 Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch Priority = 100 set queue batch max_running = 300 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = admin.default.domain set server acl_hosts += wings.glerl.noaa.gov set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 8 By the way, what is the best way to get both the server and scheduler to start at run time? > > I hope this helps, > Gus Correa > > On Feb 14, 2012, at 10:36 AM, Grigory Shamov wrote: > > > Do you have a scheduler installed? Like, Maui, Moab? > > > > > > > > > > --- On Tue, 2/14/12, Christina Salls wrote: > > > > From: Christina Salls > > Subject: [torqueusers] Basic torque config > > To: "Torque Users Mailing List" , "Brian > Beagan" , "John Cardenas" , "Jeff > Hanson" , "Michael Saxon" , "help >> > GLERL IT Help" , keenandr at msu.edu > > Date: Tuesday, February 14, 2012, 6:36 AM > > > > Hi all, > > > > I finally made some progress but am not all the way there yet. I > changed the hostname of the server to admin, which is the hostname assigned > to the interface that the compute nodes are physically connected to. Now > my pbsnodes command shows the nodes as free!! > > > > [root at wings torque]# pbsnodes -a > > n001.default.domain > > state = free > > np = 1 > > ntype = cluster > > status = > rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? > 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > n002.default.domain > > state = free > > np = 1 > > ntype = cluster > > status = > rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? > 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > ....For all 20 nodes. > > > > And now when I submit a job, I get a job id back, however, the jobs > stays in the queue state. > > > > -bash-4.1$ ./example_submit_script_1 > > Fri Feb 10 15:46:35 CST 2012 > > Fri Feb 10 15:46:45 CST 2012 > > -bash-4.1$ ./example_submit_script_1 | qsub > > 6.admin.default.domain > > -bash-4.1$ qstat > > Job id Name User Time Use S > Queue > > ------------------------- ---------------- --------------- -------- - > ----- > > 4.wings STDIN salls 0 Q > batch > > 5.wings STDIN salls 0 Q > batch > > 6.admin STDIN salls 0 Q > batch > > > > I deleted the two jobs that were created when wings was the server in > case they were getting in the way > > > > [root at wings torque]# qstat > > Job id Name User Time Use S > Queue > > ------------------------- ---------------- --------------- -------- - > ----- > > 6.admin STDIN salls 0 Q > batch > > [root at wings torque]# qstat -a > > > > admin.default.domain: > > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > > 6.admin.default. salls batch STDIN -- -- -- > -- -- Q -- > > [root at wings torque]# > > > > > > I don't see anything that seems significant in the logs: > > > > Lots of entries like this in the server log: > > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version > = 2.5.9, loglevel = 0 > > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version > = 2.5.9, loglevel = 0 > > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version > = 2.5.9, loglevel = 0 > > > > This is the entirety of the sched_log: > > > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file > /var/spool/torque/sched_priv/accounting/20120210 opened > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576 > > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512 > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened > > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address > already in use (98) in main, bind > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed > > > > mom logs on the compute nodes have the same multiple entries: > > > > 02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > 02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > 02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > 02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > 02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > > > ps looks like this: > > > > -bash-4.1$ ps -ef | grep pbs > > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > > salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs > > root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H > admin.default.domain > > > > The server and queue settings are as follows: > > > > Qmgr: list server > > Server admin.default.domain > > server_state = Active > > scheduling = True > > total_jobs = 1 > > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 > Exiting:0 > > acl_hosts = admin.default.domain,wings.glerl.noaa.gov > > default_queue = batch > > log_events = 511 > > mail_from = adm > > scheduler_iteration = 600 > > node_check_rate = 150 > > tcp_timeout = 6 > > mom_job_sync = True > > pbs_version = 2.5.9 > > keep_completed = 300 > > next_job_number = 7 > > net_counter = 1 0 0 > > > > Qmgr: list queue batch > > Queue batch > > queue_type = Execution > > Priority = 100 > > total_jobs = 1 > > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 > Exiting:0 > > max_running = 300 > > mtime = Thu Feb 9 18:22:33 2012 > > enabled = True > > started = True > > > > Do I need to create a routing queue? It seems like I am missing a basic > element here. > > > > Thanks in advance, > > > > Christina > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > > > -----Inline Attachment Follows----- > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120214/75ddcf80/attachment-0001.html From gus at ldeo.columbia.edu Tue Feb 14 11:43:30 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Tue, 14 Feb 2012 13:43:30 -0500 Subject: [torqueusers] Basic torque config In-Reply-To: References: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> <789FA99A-FFE5-488D-A8F7-0E7957477D91@ldeo.columbia.edu> Message-ID: On Feb 14, 2012, at 1:28 PM, Christina Salls wrote: > > > On Tue, Feb 14, 2012 at 1:24 PM, Gustavo Correa wrote: > Make sure pbs_sched [Xor alternatively maui, if you installed it] is running. > > Thanks for the response. > > It appears to be running. > > [root at wings etc]# ps -ef | grep pbs > root 6896 6509 0 12:25 pts/24 00:00:00 grep pbs > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > root 25810 1 0 Feb10 ? 00:00:26 pbs_server -H admin.default.domain > > > Also, as root, on the pbs_server computer, enable scheduling: > qmgr -c 'set server scheduling=True' > > And it appears that server scheduling is already set for True > > [root at wings etc]# qmgr > Max open servers: 10239 > Qmgr: print server > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch Priority = 100 > set queue batch max_running = 300 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = admin.default.domain > set server acl_hosts += wings.glerl.noaa.gov > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 8 > If you made changes in the nodes file, etc, restart the server, etc, just in case: service pbs_server restart service pbs_sched restart service pbs_mom restart [this one on the compute nodes] Then check the pbs_server logs [$TORQUE/server_logs] and the system logs in the computer where pbs_server runs [/var/log/messages]. There may be messages in either one with hints about the actual problem. > By the way, what is the best way to get both the server and scheduler to start at run time? > It depends on your OS and Linux distribution. Normally you put the pbs_sched and pbs_server scripts in /etc/init.d [they come in the Torque 'contrib' directory, I think, but if you installed from RPMs or other packages they may already be there]. On the compute nodes you put pbs_mom there. If your pbs_server computer will also be used as a compute node, add pbs_mom there too. Then schedule them to start at init/boot time with chkconfig [which the Fedora folks bundled now into something called systemctl, in case you use Fedora]. I hope it helps, Gus Correa > I hope this helps, > Gus Correa > > On Feb 14, 2012, at 10:36 AM, Grigory Shamov wrote: > > > Do you have a scheduler installed? Like, Maui, Moab? > > > > > > > > > > --- On Tue, 2/14/12, Christina Salls wrote: > > > > From: Christina Salls > > Subject: [torqueusers] Basic torque config > > To: "Torque Users Mailing List" , "Brian Beagan" , "John Cardenas" , "Jeff Hanson" , "Michael Saxon" , "help >> GLERL IT Help" , keenandr at msu.edu > > Date: Tuesday, February 14, 2012, 6:36 AM > > > > Hi all, > > > > I finally made some progress but am not all the way there yet. I changed the hostname of the server to admin, which is the hostname assigned to the interface that the compute nodes are physically connected to. Now my pbsnodes command shows the nodes as free!! > > > > [root at wings torque]# pbsnodes -a > > n001.default.domain > > state = free > > np = 1 > > ntype = cluster > > status = rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > n002.default.domain > > state = free > > np = 1 > > ntype = cluster > > status = rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux > > gpus = 0 > > > > ....For all 20 nodes. > > > > And now when I submit a job, I get a job id back, however, the jobs stays in the queue state. > > > > -bash-4.1$ ./example_submit_script_1 > > Fri Feb 10 15:46:35 CST 2012 > > Fri Feb 10 15:46:45 CST 2012 > > -bash-4.1$ ./example_submit_script_1 | qsub > > 6.admin.default.domain > > -bash-4.1$ qstat > > Job id Name User Time Use S Queue > > ------------------------- ---------------- --------------- -------- - ----- > > 4.wings STDIN salls 0 Q batch > > 5.wings STDIN salls 0 Q batch > > 6.admin STDIN salls 0 Q batch > > > > I deleted the two jobs that were created when wings was the server in case they were getting in the way > > > > [root at wings torque]# qstat > > Job id Name User Time Use S Queue > > ------------------------- ---------------- --------------- -------- - ----- > > 6.admin STDIN salls 0 Q batch > > [root at wings torque]# qstat -a > > > > admin.default.domain: > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > > 6.admin.default. salls batch STDIN -- -- -- -- -- Q -- > > [root at wings torque]# > > > > > > I don't see anything that seems significant in the logs: > > > > Lots of entries like this in the server log: > > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 > > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 > > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 > > > > This is the entirety of the sched_log: > > > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120210 opened > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid 12576 > > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512 > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened > > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address already in use (98) in main, bind > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed > > > > mom logs on the compute nodes have the same multiple entries: > > > > 02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > 02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > 02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > 02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > 02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > > > ps looks like this: > > > > -bash-4.1$ ps -ef | grep pbs > > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > > salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs > > root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H admin.default.domain > > > > The server and queue settings are as follows: > > > > Qmgr: list server > > Server admin.default.domain > > server_state = Active > > scheduling = True > > total_jobs = 1 > > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 > > acl_hosts = admin.default.domain,wings.glerl.noaa.gov > > default_queue = batch > > log_events = 511 > > mail_from = adm > > scheduler_iteration = 600 > > node_check_rate = 150 > > tcp_timeout = 6 > > mom_job_sync = True > > pbs_version = 2.5.9 > > keep_completed = 300 > > next_job_number = 7 > > net_counter = 1 0 0 > > > > Qmgr: list queue batch > > Queue batch > > queue_type = Execution > > Priority = 100 > > total_jobs = 1 > > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 > > max_running = 300 > > mtime = Thu Feb 9 18:22:33 2012 > > enabled = True > > started = True > > > > Do I need to create a routing queue? It seems like I am missing a basic element here. > > > > Thanks in advance, > > > > Christina > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > > > -----Inline Attachment Follows----- > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Tue Feb 14 11:59:11 2012 From: glen.beane at gmail.com (Glen Beane) Date: Tue, 14 Feb 2012 13:59:11 -0500 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> Message-ID: -l procs= is supported by Moab and pbs_sched, I do not believe that Maui handles it properly. On Tue, Feb 14, 2012 at 1:11 PM, Lance Westerhoff wrote: > > Hello All- > > We're still having trouble with this feature, and we are starting to shop around for a torque/maui replacement in order to be able to use it. Before we do that however, I wanted to see if anyone has any thoughts on how to address the problem within torque/maui. Perhaps I simply don't understand the feature. The versions of torque and maui we are using are. > > ? ? ? ?torque-3.0.2 > ? ? ? ?maui-3.2.6p21 > > Yes, we have tried newer versions of maui, but then the option doesn't work at all. > > Here is the scenario (I also included the conversation from November below for more information). > > Conceptually, our software is almost infinitely scalable in the sense that there is very little overhead associated with interprocess communication. Therefore, we do not require that all of the processes reside on a small number of nodes. In fact, we can stretch the processors to any and all nodes in the cluster with ~zero loss in performance. So we can literally have one node that has a single process running and another node that has 8 processes running. Since we have that level of scalability, we don't want to have to lock ourselves into having to request resources using the "nodes=X:ppn=Y" style since this style requires that nodes open up or drain in order to use them. Since our users have a big mixture of single and multi-processor jobs, waiting for node drain can really waste a lot of resources. > > I saw the "procs=#" the Requesting Resources table (see http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resources for more). It *appears* that this option should be able to allow the user to request simply X*Y processors and the scheduler should be able to schedule them any way it can fit. So using the following #PBS note, we should be able to request 40 processors: > > #PBS -l procs=40 > > Instead, we see that the scheduler seems to take this information, read it, and basically disregard it. The reason I know it reads it is because if I ask for say 40 processors and 40 processors are available in the cluster, it works as expected and all is right with the world. Where it gets a bit more choppy is when I ask for 40 processors and only 1 processor is available. The job doesn't wait in the queue for the remaining 39 processors to open up, and instead PBS simply just starts the job on that processor. I can't see how that is anything but a bug. If the user is asking for 40 processors, why isn't the scheduler waiting for all 40 processors to open up? > > I'll also post this to the maui list so I apologize if you receive it twice. I'm just not sure if this is a problem with torque, maui, or both. If answering this question will require additional information, please ask. We are at our wits end here. > > Thanks! > > -Lance > > > > > On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: > >> >> Hi Steve- >> >> Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. >> >> Thanks for your help! >> >> =============== >> >> + head job.pbs >> >> #!/bin/bash >> #PBS -S /bin/bash >> #PBS -l procs=100 >> #PBS -l pmem=700mb >> #PBS -l walltime=744:00:00 >> #PBS -j oe >> #PBS -q batch >> >> Report run on Fri Nov 18 10:49:38 EST 2011 >> + pbsnodes --version >> version: 3.0.2 >> + diagnose --version >> maui client version 3.2.6p21 >> + checkjob 371010 >> >> >> checking job 371010 >> >> State: Running >> Creds: ?user:josh ?group:games ?class:batch ?qos:DEFAULT >> WallTime: 00:02:35 of 31:00:00:00 >> SubmitTime: Fri Nov 18 10:46:33 >> ?(Time Queued ?Total: 00:00:01 ?Eligible: 00:00:01) >> >> StartTime: Fri Nov 18 10:46:34 >> Total Tasks: 1 >> >> Req[0] ?TaskCount: 26 ?Partition: DEFAULT >> Network: [NONE] ?Memory >= 700M ?Disk >= 0 ?Swap >= 0 >> Opsys: [NONE] ?Arch: [NONE] ?Features: [NONE] >> Dedicated Resources Per Task: PROCS: 1 ?MEM: 700M >> NodeCount: 10 >> Allocated Nodes: >> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] >> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] >> [compute-0-13:2][compute-0-14:2] >> >> >> IWD: [NONE] ?Executable: ?[NONE] >> Bypass: 0 ?StartCount: 1 >> PartitionMask: [ALL] >> Flags: ? ? ? RESTARTABLE >> >> Reservation '371010' (-00:02:09 -> 30:23:57:51 ?Duration: 31:00:00:00) >> PE: ?26.00 ?StartPriority: ?4716 >> >> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" >> SERVERHOST ? ? ? ? ? ?gondor >> ADMIN1 ? ? ? ? ? ? ? ?maui root >> ADMIN3 ? ? ? ? ? ? ? ?ALL >> RMCFG[base] ?TYPE=PBS >> AMCFG[bank] ?TYPE=NONE >> RMPOLLINTERVAL ? ? ? ?00:01:00 >> SERVERPORT ? ? ? ? ? ?42559 >> SERVERMODE ? ? ? ? ? ?NORMAL >> LOGFILE ? ? ? ? ? ? ? maui.log >> LOGFILEMAXSIZE ? ? ? ?10000000 >> LOGLEVEL ? ? ? ? ? ? ?3 >> QUEUETIMEWEIGHT ? ? ? 1 >> FSPOLICY ? ? ? ? ? ? ?DEDICATEDPS >> FSDEPTH ? ? ? ? ? ? ? 7 >> FSINTERVAL ? ? ? ? ? ?86400 >> FSDECAY ? ? ? ? ? ? ? 0.50 >> FSWEIGHT ? ? ? ? ? ? ?200 >> FSUSERWEIGHT ? ? ? ? ?1 >> FSGROUPWEIGHT ? ? ? ? 1000 >> FSQOSWEIGHT ? ? ? ? ? 1000 >> FSACCOUNTWEIGHT ? ? ? 1 >> FSCLASSWEIGHT ? ? ? ? 1000 >> USERWEIGHT ? ? ? ? ? ?4 >> BACKFILLPOLICY ? ? ? ?FIRSTFIT >> RESERVATIONPOLICY ? ? CURRENTHIGHEST >> NODEALLOCATIONPOLICY ?MINRESOURCE >> RESERVATIONDEPTH ? ? ? ? ? ?8 >> MAXJOBPERUSERPOLICY ? ? ? ? OFF >> MAXJOBPERUSERCOUNT ? ? ? ? ?8 >> MAXPROCPERUSERPOLICY ? ? ? ?OFF >> MAXPROCPERUSERCOUNT ? ? ? ? 256 >> MAXPROCSECONDPERUSERPOLICY ?OFF >> MAXPROCSECONDPERUSERCOUNT ? 36864000 >> MAXJOBQUEUEDPERUSERPOLICY ? OFF >> MAXJOBQUEUEDPERUSERCOUNT ? ?2 >> JOBNODEMATCHPOLICY ? ? ? ? ?EXACTNODE >> NODEACCESSPOLICY ? ? ? ? ? ?SHARED >> JOBMAXOVERRUN 99:00:00:00 >> DEFERCOUNT 8192 >> DEFERTIME ?0 >> CLASSCFG[developer] FSTARGET=40.00+ >> CLASSCFG[lowprio] PRIORITY=-1000 >> SRCFG[developer] CLASSLIST=developer >> SRCFG[developer] ACCESS=dedicated >> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri >> SRCFG[developer] STARTTIME=08:00:00 >> SRCFG[developer] ENDTIME=18:00:00 >> SRCFG[developer] TIMELIMIT=2:00:00 >> SRCFG[developer] RESOURCES=PROCS(8) >> USERCFG[DEFAULT] ? ? ?FSTARGET=100.0 >> >> =============== >> >> -Lance >> >> >> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> >>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >>> >>>> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). >>> >>> Hi Lance, >>> >>> ? ? ?Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. >>> >>> ? ? ?Might as well add your maui.cfg file also. >>> >>> ? ? ?I've found in the past that procs= is troublesome... >>> >>>> >>>> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >>>> >>>> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). >>> >>> Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. >>> >>>> >>>> -Lance >>>> >>>> >>>>> >>>>> Message: 3 >>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>>> From: "Brock Palen" >>>>> Subject: Re: [torqueusers] procs= not working as documented >>>>> To: "Torque Users Mailing List" >>>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>>> Content-Type: text/plain; charset="utf-8" >>>>> >>>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>>> >>>>> >>>>> >>>>> Brock Palen >>>>> (734)936-1985 >>>>> brockp at umich.edu >>>>> - Sent from my Palm Pre, please excuse typos >>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>>>> >>>>> >>>>> >>>>> Hello All- >>>>> >>>>> >>>>> >>>>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>>>> >>>>> >>>>> >>>>> ========================================== >>>>> >>>>> >>>>> >>>>> #PBS -S /bin/bash >>>>> >>>>> #PBS -l procs=60 >>>>> >>>>> #PBS -l pmem=700mb >>>>> >>>>> #PBS -l walltime=744:00:00 >>>>> >>>>> #PBS -j oe >>>>> >>>>> #PBS -q batch >>>>> >>>>> >>>>> >>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>>>> >>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>>>> >>>>> >>>>> >>>>> ========================================== >>>>> >>>>> >>>>> >>>>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>>>> >>>>> >>>>> >>>>> Thank you for your time! >>>>> >>>>> >>>>> >>>>> -Lance >>>>> >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> ---------------------- >>> Steve Crusan >>> System Administrator >>> Center for Research Computing >>> University of Rochester >>> https://www.crc.rochester.edu/ >>> >>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >>> Comment: GPGTools - http://gpgtools.org >>> >>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >>> =AGW7 >>> -----END PGP SIGNATURE----- >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lance at quantumbioinc.com Tue Feb 14 12:16:25 2012 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Tue, 14 Feb 2012 14:16:25 -0500 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> Message-ID: <8333B59E-1721-405F-ACEC-5B3D2E9181D9@quantumbioinc.com> Hi Glen- Yeah, that matches what we have seen as well. I have tried several different maui versions, and the problem seems to be pretty consistent. I assume that pbs_sched does not yet have FAIRSHARE - is that correct? I have put in a request for a quote for Moab, so we'll see what they come back with. Related question: if I wish to change to a different scheduler, do I need to shut down all of currently running jobs, or will jobs scheduled using the old scheduler (maui) run unaffected when I switch to the new scheduler? Thanks! -Lance On Feb 14, 2012, at 1:59 PM, Glen Beane wrote: > -l procs= is supported by Moab and pbs_sched, I do not believe that > Maui handles it properly. > > > > > On Tue, Feb 14, 2012 at 1:11 PM, Lance Westerhoff > wrote: >> >> Hello All- >> >> We're still having trouble with this feature, and we are starting to shop around for a torque/maui replacement in order to be able to use it. Before we do that however, I wanted to see if anyone has any thoughts on how to address the problem within torque/maui. Perhaps I simply don't understand the feature. The versions of torque and maui we are using are. >> >> torque-3.0.2 >> maui-3.2.6p21 >> >> Yes, we have tried newer versions of maui, but then the option doesn't work at all. >> >> Here is the scenario (I also included the conversation from November below for more information). >> >> Conceptually, our software is almost infinitely scalable in the sense that there is very little overhead associated with interprocess communication. Therefore, we do not require that all of the processes reside on a small number of nodes. In fact, we can stretch the processors to any and all nodes in the cluster with ~zero loss in performance. So we can literally have one node that has a single process running and another node that has 8 processes running. Since we have that level of scalability, we don't want to have to lock ourselves into having to request resources using the "nodes=X:ppn=Y" style since this style requires that nodes open up or drain in order to use them. Since our users have a big mixture of single and multi-processor jobs, waiting for node drain can really waste a lot of resources. >> >> I saw the "procs=#" the Requesting Resources table (see http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resources for more). It *appears* that this option should be able to allow the user to request simply X*Y processors and the scheduler should be able to schedule them any way it can fit. So using the following #PBS note, we should be able to request 40 processors: >> >> #PBS -l procs=40 >> >> Instead, we see that the scheduler seems to take this information, read it, and basically disregard it. The reason I know it reads it is because if I ask for say 40 processors and 40 processors are available in the cluster, it works as expected and all is right with the world. Where it gets a bit more choppy is when I ask for 40 processors and only 1 processor is available. The job doesn't wait in the queue for the remaining 39 processors to open up, and instead PBS simply just starts the job on that processor. I can't see how that is anything but a bug. If the user is asking for 40 processors, why isn't the scheduler waiting for all 40 processors to open up? >> >> I'll also post this to the maui list so I apologize if you receive it twice. I'm just not sure if this is a problem with torque, maui, or both. If answering this question will require additional information, please ask. We are at our wits end here. >> >> Thanks! >> >> -Lance >> >> >> >> >> On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: >> >>> >>> Hi Steve- >>> >>> Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. >>> >>> Thanks for your help! >>> >>> =============== >>> >>> + head job.pbs >>> >>> #!/bin/bash >>> #PBS -S /bin/bash >>> #PBS -l procs=100 >>> #PBS -l pmem=700mb >>> #PBS -l walltime=744:00:00 >>> #PBS -j oe >>> #PBS -q batch >>> >>> Report run on Fri Nov 18 10:49:38 EST 2011 >>> + pbsnodes --version >>> version: 3.0.2 >>> + diagnose --version >>> maui client version 3.2.6p21 >>> + checkjob 371010 >>> >>> >>> checking job 371010 >>> >>> State: Running >>> Creds: user:josh group:games class:batch qos:DEFAULT >>> WallTime: 00:02:35 of 31:00:00:00 >>> SubmitTime: Fri Nov 18 10:46:33 >>> (Time Queued Total: 00:00:01 Eligible: 00:00:01) >>> >>> StartTime: Fri Nov 18 10:46:34 >>> Total Tasks: 1 >>> >>> Req[0] TaskCount: 26 Partition: DEFAULT >>> Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> Dedicated Resources Per Task: PROCS: 1 MEM: 700M >>> NodeCount: 10 >>> Allocated Nodes: >>> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] >>> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] >>> [compute-0-13:2][compute-0-14:2] >>> >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 0 StartCount: 1 >>> PartitionMask: [ALL] >>> Flags: RESTARTABLE >>> >>> Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: 31:00:00:00) >>> PE: 26.00 StartPriority: 4716 >>> >>> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" >>> SERVERHOST gondor >>> ADMIN1 maui root >>> ADMIN3 ALL >>> RMCFG[base] TYPE=PBS >>> AMCFG[bank] TYPE=NONE >>> RMPOLLINTERVAL 00:01:00 >>> SERVERPORT 42559 >>> SERVERMODE NORMAL >>> LOGFILE maui.log >>> LOGFILEMAXSIZE 10000000 >>> LOGLEVEL 3 >>> QUEUETIMEWEIGHT 1 >>> FSPOLICY DEDICATEDPS >>> FSDEPTH 7 >>> FSINTERVAL 86400 >>> FSDECAY 0.50 >>> FSWEIGHT 200 >>> FSUSERWEIGHT 1 >>> FSGROUPWEIGHT 1000 >>> FSQOSWEIGHT 1000 >>> FSACCOUNTWEIGHT 1 >>> FSCLASSWEIGHT 1000 >>> USERWEIGHT 4 >>> BACKFILLPOLICY FIRSTFIT >>> RESERVATIONPOLICY CURRENTHIGHEST >>> NODEALLOCATIONPOLICY MINRESOURCE >>> RESERVATIONDEPTH 8 >>> MAXJOBPERUSERPOLICY OFF >>> MAXJOBPERUSERCOUNT 8 >>> MAXPROCPERUSERPOLICY OFF >>> MAXPROCPERUSERCOUNT 256 >>> MAXPROCSECONDPERUSERPOLICY OFF >>> MAXPROCSECONDPERUSERCOUNT 36864000 >>> MAXJOBQUEUEDPERUSERPOLICY OFF >>> MAXJOBQUEUEDPERUSERCOUNT 2 >>> JOBNODEMATCHPOLICY EXACTNODE >>> NODEACCESSPOLICY SHARED >>> JOBMAXOVERRUN 99:00:00:00 >>> DEFERCOUNT 8192 >>> DEFERTIME 0 >>> CLASSCFG[developer] FSTARGET=40.00+ >>> CLASSCFG[lowprio] PRIORITY=-1000 >>> SRCFG[developer] CLASSLIST=developer >>> SRCFG[developer] ACCESS=dedicated >>> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri >>> SRCFG[developer] STARTTIME=08:00:00 >>> SRCFG[developer] ENDTIME=18:00:00 >>> SRCFG[developer] TIMELIMIT=2:00:00 >>> SRCFG[developer] RESOURCES=PROCS(8) >>> USERCFG[DEFAULT] FSTARGET=100.0 >>> >>> =============== >>> >>> -Lance >>> >>> >>> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> >>>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >>>> >>>>> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). >>>> >>>> Hi Lance, >>>> >>>> Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. >>>> >>>> Might as well add your maui.cfg file also. >>>> >>>> I've found in the past that procs= is troublesome... >>>> >>>>> >>>>> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >>>>> >>>>> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). >>>> >>>> Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. >>>> >>>>> >>>>> -Lance >>>>> >>>>> >>>>>> >>>>>> Message: 3 >>>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>>>> From: "Brock Palen" >>>>>> Subject: Re: [torqueusers] procs= not working as documented >>>>>> To: "Torque Users Mailing List" >>>>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>>>> >>>>>> >>>>>> >>>>>> Brock Palen >>>>>> (734)936-1985 >>>>>> brockp at umich.edu >>>>>> - Sent from my Palm Pre, please excuse typos >>>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hello All- >>>>>> >>>>>> >>>>>> >>>>>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>>>>> >>>>>> >>>>>> >>>>>> ========================================== >>>>>> >>>>>> >>>>>> >>>>>> #PBS -S /bin/bash >>>>>> >>>>>> #PBS -l procs=60 >>>>>> >>>>>> #PBS -l pmem=700mb >>>>>> >>>>>> #PBS -l walltime=744:00:00 >>>>>> >>>>>> #PBS -j oe >>>>>> >>>>>> #PBS -q batch >>>>>> >>>>>> >>>>>> >>>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>>>>> >>>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>>>>> >>>>>> >>>>>> >>>>>> ========================================== >>>>>> >>>>>> >>>>>> >>>>>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>>>>> >>>>>> >>>>>> >>>>>> Thank you for your time! >>>>>> >>>>>> >>>>>> >>>>>> -Lance >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> ---------------------- >>>> Steve Crusan >>>> System Administrator >>>> Center for Research Computing >>>> University of Rochester >>>> https://www.crc.rochester.edu/ >>>> >>>> >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >>>> Comment: GPGTools - http://gpgtools.org >>>> >>>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >>>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >>>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >>>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >>>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >>>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >>>> =AGW7 >>>> -----END PGP SIGNATURE----- >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From glen.beane at gmail.com Tue Feb 14 12:21:53 2012 From: glen.beane at gmail.com (Glen Beane) Date: Tue, 14 Feb 2012 14:21:53 -0500 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <8333B59E-1721-405F-ACEC-5B3D2E9181D9@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> <8333B59E-1721-405F-ACEC-5B3D2E9181D9@quantumbioinc.com> Message-ID: if you switch schedulers you should be able to just stop your old schedule and let the running jobs continue to run, and then startup your new scheduler. no need to kill any running jobs pbs_sched is a simple FIFO scheduler, it doesn't have fairshare On Tue, Feb 14, 2012 at 2:16 PM, Lance Westerhoff wrote: > > Hi Glen- > > Yeah, that matches what we have seen as well. I have tried several different maui versions, and the problem seems to be pretty consistent. > > I assume that pbs_sched does not yet have FAIRSHARE - is that correct? > > I have put in a request for a quote for Moab, so we'll see what they come back with. > > Related question: if I wish to change to a different scheduler, do I need to shut down all of currently running jobs, or will jobs scheduled using the old scheduler (maui) run unaffected when I switch to the new scheduler? > > Thanks! > > -Lance > > On Feb 14, 2012, at 1:59 PM, Glen Beane wrote: > >> -l procs= is supported by Moab and pbs_sched, ?I do not believe that >> Maui handles it properly. >> >> >> >> >> On Tue, Feb 14, 2012 at 1:11 PM, Lance Westerhoff >> wrote: >>> >>> Hello All- >>> >>> We're still having trouble with this feature, and we are starting to shop around for a torque/maui replacement in order to be able to use it. Before we do that however, I wanted to see if anyone has any thoughts on how to address the problem within torque/maui. Perhaps I simply don't understand the feature. The versions of torque and maui we are using are. >>> >>> ? ? ? ?torque-3.0.2 >>> ? ? ? ?maui-3.2.6p21 >>> >>> Yes, we have tried newer versions of maui, but then the option doesn't work at all. >>> >>> Here is the scenario (I also included the conversation from November below for more information). >>> >>> Conceptually, our software is almost infinitely scalable in the sense that there is very little overhead associated with interprocess communication. Therefore, we do not require that all of the processes reside on a small number of nodes. In fact, we can stretch the processors to any and all nodes in the cluster with ~zero loss in performance. So we can literally have one node that has a single process running and another node that has 8 processes running. Since we have that level of scalability, we don't want to have to lock ourselves into having to request resources using the "nodes=X:ppn=Y" style since this style requires that nodes open up or drain in order to use them. Since our users have a big mixture of single and multi-processor jobs, waiting for node drain can really waste a lot of resources. >>> >>> I saw the "procs=#" the Requesting Resources table (see http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resources for more). It *appears* that this option should be able to allow the user to request simply X*Y processors and the scheduler should be able to schedule them any way it can fit. So using the following #PBS note, we should be able to request 40 processors: >>> >>> #PBS -l procs=40 >>> >>> Instead, we see that the scheduler seems to take this information, read it, and basically disregard it. The reason I know it reads it is because if I ask for say 40 processors and 40 processors are available in the cluster, it works as expected and all is right with the world. Where it gets a bit more choppy is when I ask for 40 processors and only 1 processor is available. The job doesn't wait in the queue for the remaining 39 processors to open up, and instead PBS simply just starts the job on that processor. I can't see how that is anything but a bug. If the user is asking for 40 processors, why isn't the scheduler waiting for all 40 processors to open up? >>> >>> I'll also post this to the maui list so I apologize if you receive it twice. I'm just not sure if this is a problem with torque, maui, or both. If answering this question will require additional information, please ask. We are at our wits end here. >>> >>> Thanks! >>> >>> -Lance >>> >>> >>> >>> >>> On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: >>> >>>> >>>> Hi Steve- >>>> >>>> Here you go. Here is the top few lines of the job script. I have then provided the output you requested long with the maui.cfg. If you need anything further, certainly please let me know. >>>> >>>> Thanks for your help! >>>> >>>> =============== >>>> >>>> + head job.pbs >>>> >>>> #!/bin/bash >>>> #PBS -S /bin/bash >>>> #PBS -l procs=100 >>>> #PBS -l pmem=700mb >>>> #PBS -l walltime=744:00:00 >>>> #PBS -j oe >>>> #PBS -q batch >>>> >>>> Report run on Fri Nov 18 10:49:38 EST 2011 >>>> + pbsnodes --version >>>> version: 3.0.2 >>>> + diagnose --version >>>> maui client version 3.2.6p21 >>>> + checkjob 371010 >>>> >>>> >>>> checking job 371010 >>>> >>>> State: Running >>>> Creds: ?user:josh ?group:games ?class:batch ?qos:DEFAULT >>>> WallTime: 00:02:35 of 31:00:00:00 >>>> SubmitTime: Fri Nov 18 10:46:33 >>>> ?(Time Queued ?Total: 00:00:01 ?Eligible: 00:00:01) >>>> >>>> StartTime: Fri Nov 18 10:46:34 >>>> Total Tasks: 1 >>>> >>>> Req[0] ?TaskCount: 26 ?Partition: DEFAULT >>>> Network: [NONE] ?Memory >= 700M ?Disk >= 0 ?Swap >= 0 >>>> Opsys: [NONE] ?Arch: [NONE] ?Features: [NONE] >>>> Dedicated Resources Per Task: PROCS: 1 ?MEM: 700M >>>> NodeCount: 10 >>>> Allocated Nodes: >>>> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] >>>> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] >>>> [compute-0-13:2][compute-0-14:2] >>>> >>>> >>>> IWD: [NONE] ?Executable: ?[NONE] >>>> Bypass: 0 ?StartCount: 1 >>>> PartitionMask: [ALL] >>>> Flags: ? ? ? RESTARTABLE >>>> >>>> Reservation '371010' (-00:02:09 -> 30:23:57:51 ?Duration: 31:00:00:00) >>>> PE: ?26.00 ?StartPriority: ?4716 >>>> >>>> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" >>>> SERVERHOST ? ? ? ? ? ?gondor >>>> ADMIN1 ? ? ? ? ? ? ? ?maui root >>>> ADMIN3 ? ? ? ? ? ? ? ?ALL >>>> RMCFG[base] ?TYPE=PBS >>>> AMCFG[bank] ?TYPE=NONE >>>> RMPOLLINTERVAL ? ? ? ?00:01:00 >>>> SERVERPORT ? ? ? ? ? ?42559 >>>> SERVERMODE ? ? ? ? ? ?NORMAL >>>> LOGFILE ? ? ? ? ? ? ? maui.log >>>> LOGFILEMAXSIZE ? ? ? ?10000000 >>>> LOGLEVEL ? ? ? ? ? ? ?3 >>>> QUEUETIMEWEIGHT ? ? ? 1 >>>> FSPOLICY ? ? ? ? ? ? ?DEDICATEDPS >>>> FSDEPTH ? ? ? ? ? ? ? 7 >>>> FSINTERVAL ? ? ? ? ? ?86400 >>>> FSDECAY ? ? ? ? ? ? ? 0.50 >>>> FSWEIGHT ? ? ? ? ? ? ?200 >>>> FSUSERWEIGHT ? ? ? ? ?1 >>>> FSGROUPWEIGHT ? ? ? ? 1000 >>>> FSQOSWEIGHT ? ? ? ? ? 1000 >>>> FSACCOUNTWEIGHT ? ? ? 1 >>>> FSCLASSWEIGHT ? ? ? ? 1000 >>>> USERWEIGHT ? ? ? ? ? ?4 >>>> BACKFILLPOLICY ? ? ? ?FIRSTFIT >>>> RESERVATIONPOLICY ? ? CURRENTHIGHEST >>>> NODEALLOCATIONPOLICY ?MINRESOURCE >>>> RESERVATIONDEPTH ? ? ? ? ? ?8 >>>> MAXJOBPERUSERPOLICY ? ? ? ? OFF >>>> MAXJOBPERUSERCOUNT ? ? ? ? ?8 >>>> MAXPROCPERUSERPOLICY ? ? ? ?OFF >>>> MAXPROCPERUSERCOUNT ? ? ? ? 256 >>>> MAXPROCSECONDPERUSERPOLICY ?OFF >>>> MAXPROCSECONDPERUSERCOUNT ? 36864000 >>>> MAXJOBQUEUEDPERUSERPOLICY ? OFF >>>> MAXJOBQUEUEDPERUSERCOUNT ? ?2 >>>> JOBNODEMATCHPOLICY ? ? ? ? ?EXACTNODE >>>> NODEACCESSPOLICY ? ? ? ? ? ?SHARED >>>> JOBMAXOVERRUN 99:00:00:00 >>>> DEFERCOUNT 8192 >>>> DEFERTIME ?0 >>>> CLASSCFG[developer] FSTARGET=40.00+ >>>> CLASSCFG[lowprio] PRIORITY=-1000 >>>> SRCFG[developer] CLASSLIST=developer >>>> SRCFG[developer] ACCESS=dedicated >>>> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri >>>> SRCFG[developer] STARTTIME=08:00:00 >>>> SRCFG[developer] ENDTIME=18:00:00 >>>> SRCFG[developer] TIMELIMIT=2:00:00 >>>> SRCFG[developer] RESOURCES=PROCS(8) >>>> USERCFG[DEFAULT] ? ? ?FSTARGET=100.0 >>>> >>>> =============== >>>> >>>> -Lance >>>> >>>> >>>> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: >>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> >>>>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >>>>> >>>>>> The request that is placed is for procs=60. Both torque and maui see that there are only 53 processors available and instead of letting the job sit in the queue and wait for all 60 processors to become available, it goes ahead and runs the job with what's available. Now if the user could ask for procs=[50-60] where 50 is the minimum number of processors to provide and 60 is the maximum, this would be a feature. But as it stands, if the user asks for 60 processors and ends up with 2 processors, the job just won't scale properly and he may as well kill it (when it shouldn't have run anyway). >>>>> >>>>> Hi Lance, >>>>> >>>>> ? ? ?Can you post the output of checkjob of an incorrectly running job. Let's take a look at what Maui thinks the job is asking for. >>>>> >>>>> ? ? ?Might as well add your maui.cfg file also. >>>>> >>>>> ? ? ?I've found in the past that procs= is troublesome... >>>>> >>>>>> >>>>>> I'm actually beginning to think the problem may be related to maui. Perhaps I'll post this same question to the maui list and see what comes back. >>>>>> >>>>>> This problem is infuriating though since without the functionality working as it should, using procs=X in torque/maui makes torque/maui work more like a submission and run system (not a queuing system). >>>>> >>>>> Agreed. HPC cluster job management is normally be set it and forget it. Anything else other than maintenance/break fixes/new features would be ridiculously time consuming. >>>>> >>>>>> >>>>>> -Lance >>>>>> >>>>>> >>>>>>> >>>>>>> Message: 3 >>>>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>>>>> From: "Brock Palen" >>>>>>> Subject: Re: [torqueusers] procs= not working as documented >>>>>>> To: "Torque Users Mailing List" >>>>>>> Message-ID: <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>>>>> Content-Type: text/plain; charset="utf-8" >>>>>>> >>>>>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Brock Palen >>>>>>> (734)936-1985 >>>>>>> brockp at umich.edu >>>>>>> - Sent from my Palm Pre, please excuse typos >>>>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff <lance at quantumbioinc.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello All- >>>>>>> >>>>>>> >>>>>>> >>>>>>> It appears that when running with the following specs, the procs= option does not actually work as expected. >>>>>>> >>>>>>> >>>>>>> >>>>>>> ========================================== >>>>>>> >>>>>>> >>>>>>> >>>>>>> #PBS -S /bin/bash >>>>>>> >>>>>>> #PBS -l procs=60 >>>>>>> >>>>>>> #PBS -l pmem=700mb >>>>>>> >>>>>>> #PBS -l walltime=744:00:00 >>>>>>> >>>>>>> #PBS -j oe >>>>>>> >>>>>>> #PBS -q batch >>>>>>> >>>>>>> >>>>>>> >>>>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option worked as documented >>>>>>> >>>>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail in terms of the procs option and it only asks for a single CPU) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ========================================== >>>>>>> >>>>>>> >>>>>>> >>>>>>> If there are fewer then 60 processors available in the cluster (in this case there were 53 available) the job will go in an take whatever is left instead of waiting for all 60 processors to free up. Any thoughts as to why this might be happening? Sometimes it doesn't really matter and 53 would be almost as good as 60, however if only 2 processors are available and the user asks for 60, I would hate for him to go in. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thank you for your time! >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Lance >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> ---------------------- >>>>> Steve Crusan >>>>> System Administrator >>>>> Center for Research Computing >>>>> University of Rochester >>>>> https://www.crc.rochester.edu/ >>>>> >>>>> >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >>>>> Comment: GPGTools - http://gpgtools.org >>>>> >>>>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >>>>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >>>>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >>>>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >>>>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >>>>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >>>>> =AGW7 >>>>> -----END PGP SIGNATURE----- >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From Gareth.Williams at csiro.au Tue Feb 14 16:27:25 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Wed, 15 Feb 2012 10:27:25 +1100 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E33@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Lance Westerhoff [mailto:lance at quantumbioinc.com] > Sent: Wednesday, 15 February 2012 5:12 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] procs= not working as documented (or > understood?) > > > Hello All- > > We're still having trouble with this feature, and we are starting to > shop around for a torque/maui replacement in order to be able to use > it. Before we do that however, I wanted to see if anyone has any > thoughts on how to address the problem within torque/maui. Perhaps I > simply don't understand the feature. The versions of torque and maui we > are using are. > > torque-3.0.2 > maui-3.2.6p21 > > Yes, we have tried newer versions of maui, but then the option doesn't > work at all. > > Here is the scenario (I also included the conversation from November > below for more information). > > Conceptually, our software is almost infinitely scalable in the sense > that there is very little overhead associated with interprocess > communication. Therefore, we do not require that all of the processes > reside on a small number of nodes. In fact, we can stretch the > processors to any and all nodes in the cluster with ~zero loss in > performance. So we can literally have one node that has a single > process running and another node that has 8 processes running. Since we > have that level of scalability, we don't want to have to lock ourselves > into having to request resources using the "nodes=X:ppn=Y" style since > this style requires that nodes open up or drain in order to use them. > Since our users have a big mixture of single and multi-processor jobs, > waiting for node drain can really waste a lot of resources. > > I saw the "procs=#" the Requesting Resources table (see > http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resou > rces for more). It *appears* that this option should be able to allow > the user to request simply X*Y processors and the scheduler should be > able to schedule them any way it can fit. So using the following #PBS > note, we should be able to request 40 processors: > > #PBS -l procs=40 > > Instead, we see that the scheduler seems to take this information, read > it, and basically disregard it. The reason I know it reads it is > because if I ask for say 40 processors and 40 processors are available > in the cluster, it works as expected and all is right with the world. > Where it gets a bit more choppy is when I ask for 40 processors and > only 1 processor is available. The job doesn't wait in the queue for > the remaining 39 processors to open up, and instead PBS simply just > starts the job on that processor. I can't see how that is anything but > a bug. If the user is asking for 40 processors, why isn't the scheduler > waiting for all 40 processors to open up? > > I'll also post this to the maui list so I apologize if you receive it > twice. I'm just not sure if this is a problem with torque, maui, or > both. If answering this question will require additional information, > please ask. We are at our wits end here. > > Thanks! > > -Lance Hi Lance, It is more-or-less equivalent to request -l nodes=40 and -l procs=40 _if_ you are using maui _and_ you don't set JOBNODEMATCHPOLICY to EXACTNODE You may need to 'fake' having a large number of nodes to make this work. There are old mailing list items describing such a setup and how to 'fake' the number of nodes. We've never actually done this at our site so I'm unsure on details. I don't like it :-) but it may suit/help you. Gareth > > > > > On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: > > > > > Hi Steve- > > > > Here you go. Here is the top few lines of the job script. I have then > provided the output you requested long with the maui.cfg. If you need > anything further, certainly please let me know. > > > > Thanks for your help! > > > > =============== > > > > + head job.pbs > > > > #!/bin/bash > > #PBS -S /bin/bash > > #PBS -l procs=100 > > #PBS -l pmem=700mb > > #PBS -l walltime=744:00:00 > > #PBS -j oe > > #PBS -q batch > > > > Report run on Fri Nov 18 10:49:38 EST 2011 > > + pbsnodes --version > > version: 3.0.2 > > + diagnose --version > > maui client version 3.2.6p21 > > + checkjob 371010 > > > > > > checking job 371010 > > > > State: Running > > Creds: user:josh group:games class:batch qos:DEFAULT > > WallTime: 00:02:35 of 31:00:00:00 > > SubmitTime: Fri Nov 18 10:46:33 > > (Time Queued Total: 00:00:01 Eligible: 00:00:01) > > > > StartTime: Fri Nov 18 10:46:34 > > Total Tasks: 1 > > > > Req[0] TaskCount: 26 Partition: DEFAULT > > Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 > > Opsys: [NONE] Arch: [NONE] Features: [NONE] > > Dedicated Resources Per Task: PROCS: 1 MEM: 700M > > NodeCount: 10 > > Allocated Nodes: > > [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] > > [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] > > [compute-0-13:2][compute-0-14:2] > > > > > > IWD: [NONE] Executable: [NONE] > > Bypass: 0 StartCount: 1 > > PartitionMask: [ALL] > > Flags: RESTARTABLE > > > > Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: > 31:00:00:00) > > PE: 26.00 StartPriority: 4716 > > > > + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" > > SERVERHOST gondor > > ADMIN1 maui root > > ADMIN3 ALL > > RMCFG[base] TYPE=PBS > > AMCFG[bank] TYPE=NONE > > RMPOLLINTERVAL 00:01:00 > > SERVERPORT 42559 > > SERVERMODE NORMAL > > LOGFILE maui.log > > LOGFILEMAXSIZE 10000000 > > LOGLEVEL 3 > > QUEUETIMEWEIGHT 1 > > FSPOLICY DEDICATEDPS > > FSDEPTH 7 > > FSINTERVAL 86400 > > FSDECAY 0.50 > > FSWEIGHT 200 > > FSUSERWEIGHT 1 > > FSGROUPWEIGHT 1000 > > FSQOSWEIGHT 1000 > > FSACCOUNTWEIGHT 1 > > FSCLASSWEIGHT 1000 > > USERWEIGHT 4 > > BACKFILLPOLICY FIRSTFIT > > RESERVATIONPOLICY CURRENTHIGHEST > > NODEALLOCATIONPOLICY MINRESOURCE > > RESERVATIONDEPTH 8 > > MAXJOBPERUSERPOLICY OFF > > MAXJOBPERUSERCOUNT 8 > > MAXPROCPERUSERPOLICY OFF > > MAXPROCPERUSERCOUNT 256 > > MAXPROCSECONDPERUSERPOLICY OFF > > MAXPROCSECONDPERUSERCOUNT 36864000 > > MAXJOBQUEUEDPERUSERPOLICY OFF > > MAXJOBQUEUEDPERUSERCOUNT 2 > > JOBNODEMATCHPOLICY EXACTNODE > > NODEACCESSPOLICY SHARED > > JOBMAXOVERRUN 99:00:00:00 > > DEFERCOUNT 8192 > > DEFERTIME 0 > > CLASSCFG[developer] FSTARGET=40.00+ > > CLASSCFG[lowprio] PRIORITY=-1000 > > SRCFG[developer] CLASSLIST=developer > > SRCFG[developer] ACCESS=dedicated > > SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri > > SRCFG[developer] STARTTIME=08:00:00 > > SRCFG[developer] ENDTIME=18:00:00 > > SRCFG[developer] TIMELIMIT=2:00:00 > > SRCFG[developer] RESOURCES=PROCS(8) > > USERCFG[DEFAULT] FSTARGET=100.0 > > > > =============== > > > > -Lance > > > > > > On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > > > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> > >> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > >> > >>> The request that is placed is for procs=60. Both torque and maui > see that there are only 53 processors available and instead of letting > the job sit in the queue and wait for all 60 processors to become > available, it goes ahead and runs the job with what's available. Now if > the user could ask for procs=[50-60] where 50 is the minimum number of > processors to provide and 60 is the maximum, this would be a feature. > But as it stands, if the user asks for 60 processors and ends up with 2 > processors, the job just won't scale properly and he may as well kill > it (when it shouldn't have run anyway). > >> > >> Hi Lance, > >> > >> Can you post the output of checkjob of an incorrectly > running job. Let's take a look at what Maui thinks the job is asking > for. > >> > >> Might as well add your maui.cfg file also. > >> > >> I've found in the past that procs= is troublesome... > >> > >>> > >>> I'm actually beginning to think the problem may be related to maui. > Perhaps I'll post this same question to the maui list and see what > comes back. > >>> > >>> This problem is infuriating though since without the functionality > working as it should, using procs=X in torque/maui makes torque/maui > work more like a submission and run system (not a queuing system). > >> > >> Agreed. HPC cluster job management is normally be set it and forget > it. Anything else other than maintenance/break fixes/new features would > be ridiculously time consuming. > >> > >>> > >>> -Lance > >>> > >>> > >>>> > >>>> Message: 3 > >>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 > >>>> From: "Brock Palen" > >>>> Subject: Re: [torqueusers] procs= not working as documented > >>>> To: "Torque Users Mailing List" > >>>> Message-ID: > <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > >>>> Content-Type: text/plain; charset="utf-8" > >>>> > >>>> Does maui only see one cpu or does mpiexec only see one cpu? > >>>> > >>>> > >>>> > >>>> Brock Palen > >>>> (734)936-1985 > >>>> brockp at umich.edu > >>>> - Sent from my Palm Pre, please excuse typos > >>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff > <lance at quantumbioinc.com> wrote: > >>>> > >>>> > >>>> > >>>> Hello All- > >>>> > >>>> > >>>> > >>>> It appears that when running with the following specs, the procs= > option does not actually work as expected. > >>>> > >>>> > >>>> > >>>> ========================================== > >>>> > >>>> > >>>> > >>>> #PBS -S /bin/bash > >>>> > >>>> #PBS -l procs=60 > >>>> > >>>> #PBS -l pmem=700mb > >>>> > >>>> #PBS -l walltime=744:00:00 > >>>> > >>>> #PBS -j oe > >>>> > >>>> #PBS -q batch > >>>> > >>>> > >>>> > >>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option > worked as documented > >>>> > >>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete > fail in terms of the procs option and it only asks for a single CPU) > >>>> > >>>> > >>>> > >>>> ========================================== > >>>> > >>>> > >>>> > >>>> If there are fewer then 60 processors available in the cluster (in > this case there were 53 available) the job will go in an take whatever > is left instead of waiting for all 60 processors to free up. Any > thoughts as to why this might be happening? Sometimes it doesn't really > matter and 53 would be almost as good as 60, however if only 2 > processors are available and the user asks for 60, I would hate for him > to go in. > >>>> > >>>> > >>>> > >>>> Thank you for your time! > >>>> > >>>> > >>>> > >>>> -Lance > >>>> > >>>> > >>>> > >>>> > >>> > >>> _______________________________________________ > >>> torqueusers mailing list > >>> torqueusers at supercluster.org > >>> http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> ---------------------- > >> Steve Crusan > >> System Administrator > >> Center for Research Computing > >> University of Rochester > >> https://www.crc.rochester.edu/ > >> > >> > >> -----BEGIN PGP SIGNATURE----- > >> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > >> Comment: GPGTools - http://gpgtools.org > >> > >> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > >> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > >> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > >> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > >> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > >> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > >> =AGW7 > >> -----END PGP SIGNATURE----- > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > From R.M.Krug at gmail.com Wed Feb 15 06:30:41 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 14:30:41 +0100 Subject: [torqueusers] Showing feature properties of all nodes? Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi possibly simple question: How can I see the feature properties of all nodes? I an just a normal user. Thanks, Rainer -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk87s4EACgkQoYgNqgF2egoT0wCZAZlQ9P7x9+6w4CrnXHmJADTf GGwAn0LoVoqhQKxLZ3mtATmUK4JGScyt =8nVF -----END PGP SIGNATURE----- From heine at sun.ac.za Wed Feb 15 07:26:24 2012 From: heine at sun.ac.za (De Jager, Heine ) Date: Wed, 15 Feb 2012 16:26:24 +0200 Subject: [torqueusers] Showing feature properties of all nodes? In-Reply-To: References: Message-ID: <194209F3BDF55446915DBA2900CDE2229C016326B9@STBEVS07.stb.sun.ac.za> Rainer, You can see most of the node properties using the qnodes command. Including the defined features. Regards, Heine ________________________________________ From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] On Behalf Of Rainer M Krug [R.M.Krug at gmail.com] Sent: Wednesday, February 15, 2012 3:30 PM To: torqueusers at supercluster.org Subject: [torqueusers] Showing feature properties of all nodes? -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi possibly simple question: How can I see the feature properties of all nodes? I an just a normal user. Thanks, Rainer -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk87s4EACgkQoYgNqgF2egoT0wCZAZlQ9P7x9+6w4CrnXHmJADTf GGwAn0LoVoqhQKxLZ3mtATmUK4JGScyt =8nVF -----END PGP SIGNATURE----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers E-pos vrywaringsklousule Hierdie e-pos mag vertroulike inligting bevat en mag regtens geprivilegeerd wees en is slegs bedoel vir die persoon aan wie dit geadresseer is. Indien u nie die bedoelde ontvanger is nie, word u hiermee in kennis gestel dat u hierdie dokument geensins mag gebruik, versprei of kopieer nie. Stel ook asseblief die sender onmiddellik per telefoon in kennis en vee die e-pos uit. Die Universiteit aanvaar nie aanspreeklikheid vir enige skade, verlies of uitgawe wat voortspruit uit hierdie e-pos en/of die oopmaak van enige l?ers aangeheg by hierdie e-pos nie. E-mail disclaimer This e-mail may contain confidential information and may be legally privileged and is intended only for the person to whom it is addressed. If you are not the intended recipient, you are notified that you may not use, distribute or copy this document in any manner whatsoever. Kindly also notify the sender immediately by telephone, and delete the e-mail. The University does not accept liability for any damage, loss or expense arising from this e-mail and/or accessing any files attached to this e-mail. From r.m.krug at gmail.com Wed Feb 15 07:33:02 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 15:33:02 +0100 Subject: [torqueusers] Showing feature properties of all nodes? In-Reply-To: <194209F3BDF55446915DBA2900CDE2229C016326B9@STBEVS07.stb.sun.ac.za> References: <194209F3BDF55446915DBA2900CDE2229C016326B9@STBEVS07.stb.sun.ac.za> Message-ID: <4F3BC21E.6060200@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks Heine tried it before but must have overlooked the feature property. Cheers, Rainer On 15/02/12 15:26, De Jager, Heine wrote: > Rainer, > > You can see most of the node properties using the qnodes command. > Including the defined features. > > Regards, Heine ________________________________________ From: > torqueusers-bounces at supercluster.org > [torqueusers-bounces at supercluster.org] On Behalf Of Rainer M Krug > [R.M.Krug at gmail.com] Sent: Wednesday, February 15, 2012 3:30 PM To: > torqueusers at supercluster.org Subject: [torqueusers] Showing feature > properties of all nodes? > > Hi > > possibly simple question: How can I see the feature properties of > all nodes? I an just a normal user. > > Thanks, > > Rainer > > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > E-pos vrywaringsklousule > > Hierdie e-pos mag vertroulike inligting bevat en mag regtens > geprivilegeerd wees en is slegs bedoel vir die persoon aan wie dit > geadresseer is. Indien u nie die bedoelde ontvanger is nie, word u > hiermee in kennis gestel dat u hierdie dokument geensins mag > gebruik, versprei of kopieer nie. Stel ook asseblief die sender > onmiddellik per telefoon in kennis en vee die e-pos uit. Die > Universiteit aanvaar nie aanspreeklikheid vir enige skade, verlies > of uitgawe wat voortspruit uit hierdie e-pos en/of die oopmaak van > enige l?ers aangeheg by hierdie e-pos nie. > > E-mail disclaimer > > This e-mail may contain confidential information and may be legally > privileged and is intended only for the person to whom it is > addressed. If you are not the intended recipient, you are notified > that you may not use, distribute or copy this document in any > manner whatsoever. Kindly also notify the sender immediately by > telephone, and delete the e-mail. The University does not accept > liability for any damage, loss or expense arising from this e-mail > and/or accessing any files attached to this e-mail. > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk87wh4ACgkQoYgNqgF2egoMtwCfUCfo9e1id0dQkJ/9/DOk48tU qG0AoIfk/M31k4lej9QMD//sN4Xts26x =q9Ml -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Wed Feb 15 07:33:02 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 15:33:02 +0100 Subject: [torqueusers] Showing feature properties of all nodes? In-Reply-To: <194209F3BDF55446915DBA2900CDE2229C016326B9@STBEVS07.stb.sun.ac.za> References: <194209F3BDF55446915DBA2900CDE2229C016326B9@STBEVS07.stb.sun.ac.za> Message-ID: <4F3BC21E.6060200@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks Heine tried it before but must have overlooked the feature property. Cheers, Rainer On 15/02/12 15:26, De Jager, Heine wrote: > Rainer, > > You can see most of the node properties using the qnodes command. > Including the defined features. > > Regards, Heine ________________________________________ From: > torqueusers-bounces at supercluster.org > [torqueusers-bounces at supercluster.org] On Behalf Of Rainer M Krug > [R.M.Krug at gmail.com] Sent: Wednesday, February 15, 2012 3:30 PM To: > torqueusers at supercluster.org Subject: [torqueusers] Showing feature > properties of all nodes? > > Hi > > possibly simple question: How can I see the feature properties of > all nodes? I an just a normal user. > > Thanks, > > Rainer > > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > E-pos vrywaringsklousule > > Hierdie e-pos mag vertroulike inligting bevat en mag regtens > geprivilegeerd wees en is slegs bedoel vir die persoon aan wie dit > geadresseer is. Indien u nie die bedoelde ontvanger is nie, word u > hiermee in kennis gestel dat u hierdie dokument geensins mag > gebruik, versprei of kopieer nie. Stel ook asseblief die sender > onmiddellik per telefoon in kennis en vee die e-pos uit. Die > Universiteit aanvaar nie aanspreeklikheid vir enige skade, verlies > of uitgawe wat voortspruit uit hierdie e-pos en/of die oopmaak van > enige l?ers aangeheg by hierdie e-pos nie. > > E-mail disclaimer > > This e-mail may contain confidential information and may be legally > privileged and is intended only for the person to whom it is > addressed. If you are not the intended recipient, you are notified > that you may not use, distribute or copy this document in any > manner whatsoever. Kindly also notify the sender immediately by > telephone, and delete the e-mail. The University does not accept > liability for any damage, loss or expense arising from this e-mail > and/or accessing any files attached to this e-mail. > _______________________________________________ torqueusers mailing > list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk87wh4ACgkQoYgNqgF2egoMtwCfUCfo9e1id0dQkJ/9/DOk48tU qG0AoIfk/M31k4lej9QMD//sN4Xts26x =q9Ml -----END PGP SIGNATURE----- From sm4082 at nyu.edu Wed Feb 15 07:40:05 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 15 Feb 2012 09:40:05 -0500 Subject: [torqueusers] Showing feature properties of all nodes? In-Reply-To: <4F3BC21E.6060200@gmail.com> References: <194209F3BDF55446915DBA2900CDE2229C016326B9@STBEVS07.stb.sun.ac.za> <4F3BC21E.6060200@gmail.com> Message-ID: <7E2A8C4F-8C07-4F86-B69B-AFFC3884F5CE@nyu.edu> Hi Rainer, Apart from qnodes the file /server_priv/nodes shows all the features of all the nodes on the cluster. But only admin can see this. Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Feb 15, 2012, at 9:33, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thanks Heine > > tried it before but must have overlooked the feature property. > > Cheers, > > Rainer > > > On 15/02/12 15:26, De Jager, Heine wrote: >> Rainer, >> >> You can see most of the node properties using the qnodes command. >> Including the defined features. >> >> Regards, Heine ________________________________________ From: >> torqueusers-bounces at supercluster.org >> [torqueusers-bounces at supercluster.org] On Behalf Of Rainer M Krug >> [R.M.Krug at gmail.com] Sent: Wednesday, February 15, 2012 3:30 PM To: >> torqueusers at supercluster.org Subject: [torqueusers] Showing feature >> properties of all nodes? >> >> Hi >> >> possibly simple question: How can I see the feature properties of >> all nodes? I an just a normal user. >> >> Thanks, >> >> Rainer >> >> _______________________________________________ torqueusers mailing >> list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> E-pos vrywaringsklousule >> >> Hierdie e-pos mag vertroulike inligting bevat en mag regtens >> geprivilegeerd wees en is slegs bedoel vir die persoon aan wie dit >> geadresseer is. Indien u nie die bedoelde ontvanger is nie, word u >> hiermee in kennis gestel dat u hierdie dokument geensins mag >> gebruik, versprei of kopieer nie. Stel ook asseblief die sender >> onmiddellik per telefoon in kennis en vee die e-pos uit. Die >> Universiteit aanvaar nie aanspreeklikheid vir enige skade, verlies >> of uitgawe wat voortspruit uit hierdie e-pos en/of die oopmaak van >> enige l?ers aangeheg by hierdie e-pos nie. >> >> E-mail disclaimer >> >> This e-mail may contain confidential information and may be legally >> privileged and is intended only for the person to whom it is >> addressed. If you are not the intended recipient, you are notified >> that you may not use, distribute or copy this document in any >> manner whatsoever. Kindly also notify the sender immediately by >> telephone, and delete the e-mail. The University does not accept >> liability for any damage, loss or expense arising from this e-mail >> and/or accessing any files attached to this e-mail. >> _______________________________________________ torqueusers mailing >> list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk87wh4ACgkQoYgNqgF2egoMtwCfUCfo9e1id0dQkJ/9/DOk48tU > qG0AoIfk/M31k4lej9QMD//sN4Xts26x > =q9Ml > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From vner75 at gmail.com Wed Feb 15 07:50:01 2012 From: vner75 at gmail.com (Vahe nr) Date: Wed, 15 Feb 2012 18:50:01 +0400 Subject: [torqueusers] Job execution problem Message-ID: Dear all I have a strange problem when submitting a job on a Linux Cluster, so I will detail some information to help to identify the problem: The Torque version on my controller node is: torque-server-2.5.7-7.el5.x86_64 The Torque version on my compute nodes is: torque-client-2.5.7-7.el5.x86_64 There is an ssh access without password, no iptables on both sides. pbnodes -a shows the all available nodes on free state This is the part of error in server_log: 02/15/2012 11:34:19;0008;PBS_Server;Job;220.ce.seua-cluster.grid.am;send of job to wn1.seua-cluster.grid.am failed error = 15002 02/15/2012 11:34:19;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Undefined attribute (15002) in send_job, child failed in previous commit request for job 220.ce.seua-cluster.grid.am 02/15/2012 11:34:19;0008;PBS_Server;Job;220.ce.seua-cluster.grid.am;unable to run job, MOM rejected/rc=1 02/15/2012 11:34:19;0080;PBS_Server;Req;req_reject;Reject reply code=15043(Execution server rejected request MSG=cannot send job to mom, state=PRERUN), aux=0, type=RunJob, from root at ce.seua-cluster.grid.am 02/15/2012 11:34:19;0040;PBS_Server;Svr;ce.seua-cluster.grid.am;Scheduler was sent the command new This is the message output from checkjob command: checkjob 220 checking job 220 State: Idle WallTime: 00:00:00 of 00:01:00 SubmitTime: Wed Feb 15 09:58:51 (Time Queued Total: 4:49:09 Eligible: 00:00:00) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 25 PartitionMask: [ALL] Flags: RESTARTABLE Holds: Batch Messages: cannot start job - RM failure, rc: 15043, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN' PE: 1.00 StartPriority: 194 cannot select job 220 for partition DEFAULT (job hold active) I would appreciate any help. Thanks in advance. Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120215/9f5a5c1b/attachment-0001.html From christina.salls at noaa.gov Wed Feb 15 08:02:19 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Wed, 15 Feb 2012 10:02:19 -0500 Subject: [torqueusers] Basic torque config In-Reply-To: References: <1329233819.78038.YahooMailClassic@web111303.mail.gq1.yahoo.com> <789FA99A-FFE5-488D-A8F7-0E7957477D91@ldeo.columbia.edu> Message-ID: On Tue, Feb 14, 2012 at 1:43 PM, Gustavo Correa wrote: > > On Feb 14, 2012, at 1:28 PM, Christina Salls wrote: > > > > > > > On Tue, Feb 14, 2012 at 1:24 PM, Gustavo Correa > wrote: > > Make sure pbs_sched [Xor alternatively maui, if you installed it] is > running. > > > > Thanks for the response. > > > > It appears to be running. > > > > [root at wings etc]# ps -ef | grep pbs > > root 6896 6509 0 12:25 pts/24 00:00:00 grep pbs > > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > > root 25810 1 0 Feb10 ? 00:00:26 pbs_server -H > admin.default.domain > > > > > > Also, as root, on the pbs_server computer, enable scheduling: > > qmgr -c 'set server scheduling=True' > > > > And it appears that server scheduling is already set for True > > > > [root at wings etc]# qmgr > > Max open servers: 10239 > > Qmgr: print server > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch Priority = 100 > > set queue batch max_running = 300 > > set queue batch enabled = True > > set queue batch started = True > > # > > # Set server attributes. > > # > > set server scheduling = True > > set server acl_hosts = admin.default.domain > > set server acl_hosts += wings.glerl.noaa.gov > > set server default_queue = batch > > set server log_events = 511 > > set server mail_from = adm > > set server scheduler_iteration = 600 > > set server node_check_rate = 150 > > set server tcp_timeout = 6 > > set server mom_job_sync = True > > set server keep_completed = 300 > > set server next_job_number = 8 > > > > If you made changes in the nodes file, etc, restart the server, etc, just > in case: > service pbs_server restart > service pbs_sched restart > service pbs_mom restart [this one on the compute nodes] > > I restarted the whole cluster after I put the scripts in /etc/init.d, to make sure everything came back up. > Then check the pbs_server logs [$TORQUE/server_logs] > This is what the server log looks like when I submit a job: 02/14/2012 15:11:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 02/14/2012 15:16:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 02/14/2012 15:18:39;0100;PBS_Server;Job;8.admin.default.domain;enqueuing into batch, state 1 hop 1 02/14/2012 15:18:39;0008;PBS_Server;Job;8.admin.default.domain;Job Queued at request of salls at admin.default.domain, owner = salls at admi n.default.domain, job name = STDIN, queue = batch 02/14/2012 15:21:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 02/14/2012 15:26:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 02/14/2012 15:31:28;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.5.9, loglevel = 0 and the system logs in the computer where > pbs_server runs [/var/log/messages]. > Good idea. This is what was happening about the time I submitted the job: Feb 14 15:10:00 n004 smartd[4177]: smartd has fork()ed into background mode. New PID=4177. Feb 14 15:17:51 wings xinetd[3137]: EXIT: tftp status=0 pid=5898 duration=903(sec) Feb 14 15:26:48 wings avahi-daemon[2566]: Invalid query packet. Feb 14 15:26:48 wings avahi-daemon[2566]: Invalid query packet. Feb 14 15:26:48 wings avahi-daemon[2566]: Invalid query packet. > There may be messages in either one with hints about the actual problem. > > > By the way, what is the best way to get both the server and scheduler to > start at run time? > > > > It depends on your OS and Linux distribution. > Normally you put the pbs_sched and pbs_server scripts in /etc/init.d > [they come in the Torque 'contrib' directory, I think, but if you > installed from RPMs or > other packages they may already be there]. > On the compute nodes you put pbs_mom there. > If your pbs_server computer will also be used as a compute node, add > pbs_mom there too. > Then schedule them to start at init/boot time with chkconfig [which the > Fedora folks > bundled now into something called systemctl, in case you use Fedora]. > Thanks! I found the scripts and copied them to /etc/init.d and used chkconfig to turn them on. I am running RHEL 6.2. > > I hope it helps, > Gus Correa > > > > I hope this helps, > > Gus Correa > > > > On Feb 14, 2012, at 10:36 AM, Grigory Shamov wrote: > > > > > Do you have a scheduler installed? Like, Maui, Moab? > > > > > > > > > > > > > > > --- On Tue, 2/14/12, Christina Salls wrote: > > > > > > From: Christina Salls > > > Subject: [torqueusers] Basic torque config > > > To: "Torque Users Mailing List" , > "Brian Beagan" , "John Cardenas" , > "Jeff Hanson" , "Michael Saxon" , "help > >> GLERL IT Help" , keenandr at msu.edu > > > Date: Tuesday, February 14, 2012, 6:36 AM > > > > > > Hi all, > > > > > > I finally made some progress but am not all the way there yet. > I changed the hostname of the server to admin, which is the hostname > assigned to the interface that the compute nodes are physically connected > to. Now my pbsnodes command shows the nodes as free!! > > > > > > [root at wings torque]# pbsnodes -a > > > n001.default.domain > > > state = free > > > np = 1 > > > ntype = cluster > > > status = > rectime=1328910309,varattr=,jobs=,state=free,netload=700143,gres=,loadave=0.02,ncpus=24,physmem=20463136kb,availmem=27835692kb,totmem=28655128kb,idletime=1502,nusers=0,nsessions=? > 0,sessions=? 0,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64,opsys=linux > > > gpus = 0 > > > > > > n002.default.domain > > > state = free > > > np = 1 > > > ntype = cluster > > > status = > rectime=1328910310,varattr=,jobs=,state=free,netload=712138,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31894548kb,totmem=32792076kb,idletime=1510,nusers=0,nsessions=? > 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May > 10 15:42:40 EDT 2011 x86_64,opsys=linux > > > gpus = 0 > > > > > > ....For all 20 nodes. > > > > > > And now when I submit a job, I get a job id back, however, the jobs > stays in the queue state. > > > > > > -bash-4.1$ ./example_submit_script_1 > > > Fri Feb 10 15:46:35 CST 2012 > > > Fri Feb 10 15:46:45 CST 2012 > > > -bash-4.1$ ./example_submit_script_1 | qsub > > > 6.admin.default.domain > > > -bash-4.1$ qstat > > > Job id Name User Time Use S > Queue > > > ------------------------- ---------------- --------------- -------- - > ----- > > > 4.wings STDIN salls 0 Q > batch > > > 5.wings STDIN salls 0 Q > batch > > > 6.admin STDIN salls 0 Q > batch > > > > > > I deleted the two jobs that were created when wings was the server in > case they were getting in the way > > > > > > [root at wings torque]# qstat > > > Job id Name User Time Use S > Queue > > > ------------------------- ---------------- --------------- -------- - > ----- > > > 6.admin STDIN salls 0 Q > batch > > > [root at wings torque]# qstat -a > > > > > > admin.default.domain: > > > > Req'd Req'd Elap > > > Job ID Username Queue Jobname SessID NDS > TSK Memory Time S Time > > > -------------------- -------- -------- ---------------- ------ ----- > --- ------ ----- - ----- > > > 6.admin.default. salls batch STDIN -- -- > -- -- -- Q -- > > > [root at wings torque]# > > > > > > > > > I don't see anything that seems significant in the logs: > > > > > > Lots of entries like this in the server log: > > > 02/14/2012 08:05:10;0002;PBS_Server;Svr;PBS_Server;Torque Server > Version = 2.5.9, loglevel = 0 > > > 02/14/2012 08:10:10;0002;PBS_Server;Svr;PBS_Server;Torque Server > Version = 2.5.9, loglevel = 0 > > > 02/14/2012 08:15:10;0002;PBS_Server;Svr;PBS_Server;Torque Server > Version = 2.5.9, loglevel = 0 > > > > > > This is the entirety of the sched_log: > > > > > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;Log;Log opened > > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;TokenAct;Account file > /var/spool/torque/sched_priv/accounting/20120210 opened > > > 02/10/2012 07:06:52;0002; pbs_sched;Svr;main;pbs_sched startup pid > 12576 > > > 02/10/2012 07:09:14;0080; pbs_sched;Svr;main;brk point 6848512 > > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log opened > > > 02/10/2012 15:45:04;0001; pbs_sched;Svr;pbs_sched;LOG_ERROR::Address > already in use (98) in main, bind > > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;die;abnormal termination > > > 02/10/2012 15:45:04;0002; pbs_sched;Svr;Log;Log closed > > > > > > mom logs on the compute nodes have the same multiple entries: > > > > > > 02/14/2012 08:03:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > > 02/14/2012 08:08:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > > 02/14/2012 08:13:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > > 02/14/2012 08:18:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > > 02/14/2012 08:23:00;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 2.5.9, loglevel = 0 > > > > > > ps looks like this: > > > > > > -bash-4.1$ ps -ef | grep pbs > > > root 12576 1 0 Feb10 ? 00:00:00 pbs_sched > > > salls 12727 26862 0 08:19 pts/0 00:00:00 grep pbs > > > root 25810 1 0 Feb10 ? 00:00:25 pbs_server -H > admin.default.domain > > > > > > The server and queue settings are as follows: > > > > > > Qmgr: list server > > > Server admin.default.domain > > > server_state = Active > > > scheduling = True > > > total_jobs = 1 > > > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 > Exiting:0 > > > acl_hosts = admin.default.domain,wings.glerl.noaa.gov > > > default_queue = batch > > > log_events = 511 > > > mail_from = adm > > > scheduler_iteration = 600 > > > node_check_rate = 150 > > > tcp_timeout = 6 > > > mom_job_sync = True > > > pbs_version = 2.5.9 > > > keep_completed = 300 > > > next_job_number = 7 > > > net_counter = 1 0 0 > > > > > > Qmgr: list queue batch > > > Queue batch > > > queue_type = Execution > > > Priority = 100 > > > total_jobs = 1 > > > state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 > Exiting:0 > > > max_running = 300 > > > mtime = Thu Feb 9 18:22:33 2012 > > > enabled = True > > > started = True > > > > > > Do I need to create a routing queue? It seems like I am missing a > basic element here. > > > > > > Thanks in advance, > > > > > > Christina > > > > > > > > > > > > -- > > > Christina A. Salls > > > GLERL Computer Group > > > help.glerl at noaa.gov > > > Help Desk x2127 > > > Christina.Salls at noaa.gov > > > Voice Mail 734-741-2446 > > > > > > > > > > > > -----Inline Attachment Follows----- > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120215/e539fde0/attachment-0001.html From r.m.krug at gmail.com Wed Feb 15 09:38:26 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 17:38:26 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <4F38CBB7.8000107@gmail.com> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> Message-ID: <4F3BDF82.2060102@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 OK - the feature is added, but I don't get it to only use the nodes with this feature set. I have the following simple script named test.sub: #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t 1-10 #PBS -l feature='Rnodeff' echo "---------------------" and I submit it via qsub test.sub and it always runs, whatever I put into the feature line, even if the feature property "Rnodeff" does not exist. It always uses all nodes, even when I use "Rnode" which is set only for a subset of nodes. I even setup torque on my desktop and set the features to try it. Any ideas? Rainer qsub On 13/02/12 09:37, Rainer M Krug wrote: > Thanks a lot - this definitely helps. I will get in contact with > our admin to add the features to the nodes. > > Cheers, > > Rainer > > On 12/02/12 16:53, Sreedhar Manchu wrote: >> Hi Rainer, > >> Like Ken wrote it is possible with feature property. I use this >> feature heavily to place jobs on specific nodes. > >> To add feature to nodes > >> for i in {0..5}; do qmgr -c "set node node0$i properties += >> arrays"; done > >> Here feature is arrays. You can replace that with whatever you >> like. > >> Once you've done this you can get array jobs placed on these >> nodes by requesting this feature in qsub such as > >>>>> qsub the_script.sub -t 1-10 -l feature='arrays' > >> This would put your jobs on the nodes that have property arrays. >> In this case the nodes are 0 to 5. > >> In my case I wrote a qsub wrapper which goes through the pbs >> scripts and command line and adds this feature line such as #PBS >> -l feature= to the script so that they are placed >> on right nodes. This comes very handy especially when you have >> nodes with diiferent amounts of memory under the same queue. > >> If your scheduler is moab you can do really cool stuff using this >> feature property. > >> Hope this helps. > >> Sreedhar. > > > >> On 10-Feb-2012, at 2:49 AM, Rainer M Krug > > wrote: > >> On 09/02/12 23:39, Ken Nielson wrote: >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Rainer M Krug" >>>>> > To: >>>>>> torqueusers at supercluster.org >>>>>> Sent: Thursday, >>>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers] >>>>>> Specifying nodes which can be used in array job >>>>>> >>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>> >>>>>> Hi >>>>>> >>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>> node10), of which I am not the administrator. >>>>>> >>>>>> Some nodes are setup slightly different, so that a >>>>>> certain job only runs on nodes node01 to node05. >>>>>> >>>>>> So I would like to submit an array job and specify "only >>>>>> use the node01, node02, node03, node04 or node05 to run >>>>>> the each individual job". >>>>>> >>>>>> How can I do that? I know that I can use -l to specify >>>>>> resource requirements, but if I specify nodes=..., >>>>>> *each* job will allocate *all* nodes for the job, which >>>>>> is not what I want - each individual job should run on >>>>>> one of the nodes. >>>>>> >>>>>> so: >>>>>> >>>>>> qsub the_script.sub -t 1-10 >>>>>> >>>>>> and how do I specify the nodes? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Rainer >>>>> >>>>> Rainer, >>>>> >>>>> Are there feature (properties) in the nodes files of those >>>>> hosts which would allow you to specify a feature on the >>>>> qsub line? > >> No - unfortunately not. > >>>>> >>>>> Ken > >>> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8734IACgkQoYgNqgF2egoY3QCfVqXzOLwjZ+DLqgVQdMU0eLjr TVwAn0dQs8G0y+JTIc+lRtP/0oT7GCyo =vpZF -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Wed Feb 15 09:38:26 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 17:38:26 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <4F38CBB7.8000107@gmail.com> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> Message-ID: <4F3BDF82.2060102@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 OK - the feature is added, but I don't get it to only use the nodes with this feature set. I have the following simple script named test.sub: #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t 1-10 #PBS -l feature='Rnodeff' echo "---------------------" and I submit it via qsub test.sub and it always runs, whatever I put into the feature line, even if the feature property "Rnodeff" does not exist. It always uses all nodes, even when I use "Rnode" which is set only for a subset of nodes. I even setup torque on my desktop and set the features to try it. Any ideas? Rainer qsub On 13/02/12 09:37, Rainer M Krug wrote: > Thanks a lot - this definitely helps. I will get in contact with > our admin to add the features to the nodes. > > Cheers, > > Rainer > > On 12/02/12 16:53, Sreedhar Manchu wrote: >> Hi Rainer, > >> Like Ken wrote it is possible with feature property. I use this >> feature heavily to place jobs on specific nodes. > >> To add feature to nodes > >> for i in {0..5}; do qmgr -c "set node node0$i properties += >> arrays"; done > >> Here feature is arrays. You can replace that with whatever you >> like. > >> Once you've done this you can get array jobs placed on these >> nodes by requesting this feature in qsub such as > >>>>> qsub the_script.sub -t 1-10 -l feature='arrays' > >> This would put your jobs on the nodes that have property arrays. >> In this case the nodes are 0 to 5. > >> In my case I wrote a qsub wrapper which goes through the pbs >> scripts and command line and adds this feature line such as #PBS >> -l feature= to the script so that they are placed >> on right nodes. This comes very handy especially when you have >> nodes with diiferent amounts of memory under the same queue. > >> If your scheduler is moab you can do really cool stuff using this >> feature property. > >> Hope this helps. > >> Sreedhar. > > > >> On 10-Feb-2012, at 2:49 AM, Rainer M Krug > > wrote: > >> On 09/02/12 23:39, Ken Nielson wrote: >>>>> >>>>> >>>>> ----- Original Message ----- >>>>>> From: "Rainer M Krug" >>>>> > To: >>>>>> torqueusers at supercluster.org >>>>>> Sent: Thursday, >>>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers] >>>>>> Specifying nodes which can be used in array job >>>>>> >>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>> >>>>>> Hi >>>>>> >>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>> node10), of which I am not the administrator. >>>>>> >>>>>> Some nodes are setup slightly different, so that a >>>>>> certain job only runs on nodes node01 to node05. >>>>>> >>>>>> So I would like to submit an array job and specify "only >>>>>> use the node01, node02, node03, node04 or node05 to run >>>>>> the each individual job". >>>>>> >>>>>> How can I do that? I know that I can use -l to specify >>>>>> resource requirements, but if I specify nodes=..., >>>>>> *each* job will allocate *all* nodes for the job, which >>>>>> is not what I want - each individual job should run on >>>>>> one of the nodes. >>>>>> >>>>>> so: >>>>>> >>>>>> qsub the_script.sub -t 1-10 >>>>>> >>>>>> and how do I specify the nodes? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Rainer >>>>> >>>>> Rainer, >>>>> >>>>> Are there feature (properties) in the nodes files of those >>>>> hosts which would allow you to specify a feature on the >>>>> qsub line? > >> No - unfortunately not. > >>>>> >>>>> Ken > >>> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk8734IACgkQoYgNqgF2egoY3QCfVqXzOLwjZ+DLqgVQdMU0eLjr TVwAn0dQs8G0y+JTIc+lRtP/0oT7GCyo =vpZF -----END PGP SIGNATURE----- From gus at ldeo.columbia.edu Wed Feb 15 09:42:48 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 15 Feb 2012 11:42:48 -0500 Subject: [torqueusers] Showing feature properties of all nodes? In-Reply-To: References: Message-ID: <1752669A-7C03-4C16-8D62-295FA3C68136@ldeo.columbia.edu> On Feb 15, 2012, at 8:30 AM, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi > > possibly simple question: How can I see the feature properties of all > nodes? I an just a normal user. > > Thanks, > > Rainer pbsnodes > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk87s4EACgkQoYgNqgF2egoT0wCZAZlQ9P7x9+6w4CrnXHmJADTf > GGwAn0LoVoqhQKxLZ3mtATmUK4JGScyt > =8nVF > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From vner75 at gmail.com Wed Feb 15 09:49:19 2012 From: vner75 at gmail.com (Vahe nr) Date: Wed, 15 Feb 2012 20:49:19 +0400 Subject: [torqueusers] Fwd: Job execution problem In-Reply-To: <20120215164418.GD1933@atlas.science-computing.de> References: <20120215150410.GB27492@atuin.science-computing.de> <20120215164418.GD1933@atlas.science-computing.de> Message-ID: ---------- Forwarded message ---------- From: Michael Arndt Date: Wed, Feb 15, 2012 at 8:44 PM Subject: Re: [torqueusers] Job execution problem To: Vahe nr Hi Vahe you need to resend you last mail separately to the list by pressing reply you answered via PM to me, since i answered you intentionally of list. As far as your messages sugest this is not a problem related to your job only The messages suggest that pbs_mom the demon that runs your job on the exec node does not talk "well" with the PBS Servers master processes The best way to resolve the problems is: read mom logs on the exec node read sched and server logs on the master check the connection with pbs_iff It is an PBS config issue not an job problem Micha On Wed, Feb 15, 2012 at 08:31:34PM +0400, Vahe nr wrote: > Dear all > The job is always remains on Q state, when I am trying to run it with qrun > command I am getting the following error: > qrun: Execution server rejected request MSG=cannot send job to mom, > state=PRERUN 220.ce.seua-cluster.grid.am > > Cheers > > On Wed, Feb 15, 2012 at 8:03 PM, Vahe nr wrote: > > > Hi Michael > > The PBS has the same version on node and master, and the host name is > > right. I will try to use pbs_iff and let see what I will explore! > > > > Cheers > > > > On Wed, Feb 15, 2012 at 7:54 PM, Vahe nr wrote: > > > >> Hi Michael > >> *Thanks for your replay, I will check what you have suggested and let > >> you know, I hope it will help.* > >> * > >> * > >> *Cheers* > >> > >> On Wed, Feb 15, 2012 at 7:04 PM, Michael Arndt < > >> m.arndt at science-computing.de> wrote: > >> > >>> Hello Vahe, > >>> > >>> offlist: > >>> > >>> -checks: > >>> > >>> -is the PBS Version really the same on the nodes / exec hosts like for > >>> the pbs master > >>> > >>> -is the name shown for a node with pnsnodes node the same that is > >>> shown by an ssh node from your pbsmaster > >>> ( in other words: is name resolution DNS / NIS / hosts whatever the same > >>> when the PBS Master ask like what the node believes for hostnames of > >>> itself and > >>> master ? > >>> > >>> > >>> -google for pbs_iff > >>> The Hits will show you how to use pbs_iff to test the connectivity from > >>> node to master > >>> > >>> last but not least the PBS Sched_logs on Server and Mom Logs on exec host > >>> will show info aboit the problem > >>> > >>> > >>> Micha > >>> > >>> -- > >>> Vorstand/Board of Management: > >>> Dr. Bernd Finkbeiner, Michael Heinrichs, > >>> Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech > >>> Vorsitzender des Aufsichtsrats/ > >>> Chairman of the Supervisory Board: > >>> Philippe Miltin > >>> Sitz/Registered Office: Tuebingen > >>> Registergericht/Registration Court: Stuttgart > >>> Registernummer/Commercial Register No.: HRB 382196 > >>> > >>> > >> > > -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120215/31a19a4d/attachment-0001.html From knielson at adaptivecomputing.com Wed Feb 15 09:50:59 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 15 Feb 2012 09:50:59 -0700 (MST) Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <4F3BDF82.2060102@gmail.com> Message-ID: <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> ----- Original Message ----- > From: "Rainer M Krug" > To: torqueusers at supercluster.org > Sent: Wednesday, February 15, 2012 9:38:26 AM > Subject: Re: [torqueusers] Specifying nodes which can be used in array job > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > OK - the feature is added, but I don't get it to only use the nodes > with this feature set. > > I have the following simple script named test.sub: > > > #PBS -j oe > #PBS -m abe > #PBS -M Rainer at krugs.de > #PBS -V > #PBS -t 1-10 > #PBS -l feature='Rnodeff' feature is not a recognized keyword in TORQUE. One way (there may be others) to get the feature you want is to use the following syntax: -l nodes=2:Rnodeff Ken From sm4082 at nyu.edu Wed Feb 15 09:53:37 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 15 Feb 2012 11:53:37 -0500 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> References: <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> Message-ID: My bad. We use Moab and it works perfectly with -l feature="feature name". I guess this way is only for Moab. Thanks, Ken for the clarification. Sreedhar. On Feb 15, 2012, at 11:50 AM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Rainer M Krug" >> To: torqueusers at supercluster.org >> Sent: Wednesday, February 15, 2012 9:38:26 AM >> Subject: Re: [torqueusers] Specifying nodes which can be used in array job >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> OK - the feature is added, but I don't get it to only use the nodes >> with this feature set. >> >> I have the following simple script named test.sub: >> >> >> #PBS -j oe >> #PBS -m abe >> #PBS -M Rainer at krugs.de >> #PBS -V >> #PBS -t 1-10 >> #PBS -l feature='Rnodeff' > > feature is not a recognized keyword in TORQUE. One way (there may be others) to get the feature you want is to use the following syntax: > -l nodes=2:Rnodeff > > Ken > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From r.m.krug at gmail.com Wed Feb 15 10:22:39 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 18:22:39 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> References: <4F3BDF82.2060102@gmail.com> <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> Message-ID: <4F3BE9DF.2010808@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/02/12 17:50, Ken Nielson wrote: > ----- Original Message ----- >> From: "Rainer M Krug" To: >> torqueusers at supercluster.org Sent: Wednesday, February 15, 2012 >> 9:38:26 AM Subject: Re: [torqueusers] Specifying nodes which can >> be used in array job >> >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> >> OK - the feature is added, but I don't get it to only use the >> nodes with this feature set. >> >> I have the following simple script named test.sub: >> >> >> #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t >> 1-10 #PBS -l feature='Rnodeff' > > feature is not a recognized keyword in TORQUE. One way (there may > be others) to get the feature you want is to use the following > syntax: -l nodes=2:Rnodeff Thanks a lot Ken - it is working now. I actually thought we have Moab as scheduler. Cheers, Rainer > > Ken -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk876d4ACgkQoYgNqgF2egpu6wCeMkmg2/7e2HqdR/Q9eXIUKGRo tdAAn3xAoLNlYx/6s7JBacsTHcLCrlD2 =mm04 -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Wed Feb 15 10:22:39 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Wed, 15 Feb 2012 18:22:39 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> References: <4F3BDF82.2060102@gmail.com> <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> Message-ID: <4F3BE9DF.2010808@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/02/12 17:50, Ken Nielson wrote: > ----- Original Message ----- >> From: "Rainer M Krug" To: >> torqueusers at supercluster.org Sent: Wednesday, February 15, 2012 >> 9:38:26 AM Subject: Re: [torqueusers] Specifying nodes which can >> be used in array job >> >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> >> OK - the feature is added, but I don't get it to only use the >> nodes with this feature set. >> >> I have the following simple script named test.sub: >> >> >> #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t >> 1-10 #PBS -l feature='Rnodeff' > > feature is not a recognized keyword in TORQUE. One way (there may > be others) to get the feature you want is to use the following > syntax: -l nodes=2:Rnodeff Thanks a lot Ken - it is working now. I actually thought we have Moab as scheduler. Cheers, Rainer > > Ken -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk876d4ACgkQoYgNqgF2egpu6wCeMkmg2/7e2HqdR/Q9eXIUKGRo tdAAn3xAoLNlYx/6s7JBacsTHcLCrlD2 =mm04 -----END PGP SIGNATURE----- From gus at ldeo.columbia.edu Wed Feb 15 10:35:21 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 15 Feb 2012 12:35:21 -0500 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <4F3BDF82.2060102@gmail.com> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> <4F3BDF82.2060102@gmail.com> Message-ID: On Feb 15, 2012, at 11:38 AM, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > OK - the feature is added, but I don't get it to only use the nodes > with this feature set. > > I have the following simple script named test.sub: > > > #PBS -j oe > #PBS -m abe > #PBS -M Rainer at krugs.de > #PBS -V > #PBS -t 1-10 > #PBS -l feature='Rnodeff' > echo "---------------------" > > > and I submit it via > > qsub test.sub > > and it always runs, whatever I put into the feature line, even if the > feature property "Rnodeff" does not exist. It always uses all nodes, > even when I use "Rnode" which is set only for a subset of nodes. > > > I even setup torque on my desktop and set the features to try it. > > There is one way to do what I guess you want to do [i.e. launch jobs on nodes with different properties, right?] by setting up specific queues for each node property. You may need maui or moab for this, I am not sure if it works with pbs_sched. Say, create a queue qRnodeff the usual way. Then add something like this: qmgr -c 'set queue qRnodeff resources_default.neednodes = Rnodeff' where Rnodeff is a property of some of your nodes listed in the $Torque/pbs_server/nodes file. On your maui.cfg add this line: ENABLEMULTIREQJOBS TRUE Restart pbs_server and maui. On your PBS job script just direct the job to the appropriate queue: #PBS -q qRnodeff no need for "#PBS -l Rnodeff" Anyway, for these non-default configurations, it is worth reading the Torque and Maui Admin guides: http://www.adaptivecomputing.com/resources/docs/ I hope this helps. Gus Correa > Any ideas? > > Rainer > > qsub > On 13/02/12 09:37, Rainer M Krug wrote: >> Thanks a lot - this definitely helps. I will get in contact with >> our admin to add the features to the nodes. >> >> Cheers, >> >> Rainer >> >> On 12/02/12 16:53, Sreedhar Manchu wrote: >>> Hi Rainer, >> >>> Like Ken wrote it is possible with feature property. I use this >>> feature heavily to place jobs on specific nodes. >> >>> To add feature to nodes >> >>> for i in {0..5}; do qmgr -c "set node node0$i properties += >>> arrays"; done >> >>> Here feature is arrays. You can replace that with whatever you >>> like. >> >>> Once you've done this you can get array jobs placed on these >>> nodes by requesting this feature in qsub such as >> >>>>>> qsub the_script.sub -t 1-10 -l feature='arrays' >> >>> This would put your jobs on the nodes that have property arrays. >>> In this case the nodes are 0 to 5. >> >>> In my case I wrote a qsub wrapper which goes through the pbs >>> scripts and command line and adds this feature line such as #PBS >>> -l feature= to the script so that they are placed >>> on right nodes. This comes very handy especially when you have >>> nodes with diiferent amounts of memory under the same queue. >> >>> If your scheduler is moab you can do really cool stuff using this >>> feature property. >> >>> Hope this helps. >> >>> Sreedhar. >> >> >> >>> On 10-Feb-2012, at 2:49 AM, Rainer M Krug >> > wrote: >> >>> On 09/02/12 23:39, Ken Nielson wrote: >>>>>> >>>>>> >>>>>> ----- Original Message ----- >>>>>>> From: "Rainer M Krug" >>>>>> > To: >>>>>>> torqueusers at supercluster.org >>>>>>> Sent: Thursday, >>>>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers] >>>>>>> Specifying nodes which can be used in array job >>>>>>> >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>>> >>>>>>> Hi >>>>>>> >>>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>>> node10), of which I am not the administrator. >>>>>>> >>>>>>> Some nodes are setup slightly different, so that a >>>>>>> certain job only runs on nodes node01 to node05. >>>>>>> >>>>>>> So I would like to submit an array job and specify "only >>>>>>> use the node01, node02, node03, node04 or node05 to run >>>>>>> the each individual job". >>>>>>> >>>>>>> How can I do that? I know that I can use -l to specify >>>>>>> resource requirements, but if I specify nodes=..., >>>>>>> *each* job will allocate *all* nodes for the job, which >>>>>>> is not what I want - each individual job should run on >>>>>>> one of the nodes. >>>>>>> >>>>>>> so: >>>>>>> >>>>>>> qsub the_script.sub -t 1-10 >>>>>>> >>>>>>> and how do I specify the nodes? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Rainer >>>>>> >>>>>> Rainer, >>>>>> >>>>>> Are there feature (properties) in the nodes files of those >>>>>> hosts which would allow you to specify a feature on the >>>>>> qsub line? >> >>> No - unfortunately not. >> >>>>>> >>>>>> Ken >> >>>> >>>> _______________________________________________ torqueusers >>>> mailing list torqueusers at supercluster.org >>>> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk8734IACgkQoYgNqgF2egoY3QCfVqXzOLwjZ+DLqgVQdMU0eLjr > TVwAn0dQs8G0y+JTIc+lRtP/0oT7GCyo > =vpZF > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Wed Feb 15 10:42:10 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Wed, 15 Feb 2012 12:42:10 -0500 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> <4F3BDF82.2060102@gmail.com> Message-ID: <99AB06E5-9B37-4A78-87F3-98B19229A0B2@ldeo.columbia.edu> Rainer: Somehow I receive all your postings to the Torqueusers list in duplicate, one from R.M.Krug [capitals] another from r.m.krug [lowercase] at gmail dot com. I don't know why, and it is not a big deal either. Did you register twice perhaps? Gus Correa On Feb 15, 2012, at 12:35 PM, Gustavo Correa wrote: > > On Feb 15, 2012, at 11:38 AM, Rainer M Krug wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> OK - the feature is added, but I don't get it to only use the nodes >> with this feature set. >> >> I have the following simple script named test.sub: >> >> >> #PBS -j oe >> #PBS -m abe >> #PBS -M Rainer at krugs.de >> #PBS -V >> #PBS -t 1-10 >> #PBS -l feature='Rnodeff' >> echo "---------------------" >> >> >> and I submit it via >> >> qsub test.sub >> >> and it always runs, whatever I put into the feature line, even if the >> feature property "Rnodeff" does not exist. It always uses all nodes, >> even when I use "Rnode" which is set only for a subset of nodes. >> >> >> I even setup torque on my desktop and set the features to try it. >> >> > > There is one way to do what I guess you want to do [i.e. launch jobs on nodes with different > properties, right?] by setting up specific queues for each node property. > You may need maui or moab for this, I am not sure if it works with pbs_sched. > > Say, create a queue qRnodeff the usual way. > Then add something like this: > > qmgr -c 'set queue qRnodeff resources_default.neednodes = Rnodeff' > > where Rnodeff is a property of some of your nodes listed in the $Torque/pbs_server/nodes file. > > On your maui.cfg add this line: > ENABLEMULTIREQJOBS TRUE > > Restart pbs_server and maui. > > On your PBS job script just direct the job to the appropriate queue: > #PBS -q qRnodeff > > no need for "#PBS -l Rnodeff" > > Anyway, for these non-default configurations, it is worth reading the > Torque and Maui Admin guides: > http://www.adaptivecomputing.com/resources/docs/ > > I hope this helps. > > Gus Correa > > >> Any ideas? >> >> Rainer >> >> qsub >> On 13/02/12 09:37, Rainer M Krug wrote: >>> Thanks a lot - this definitely helps. I will get in contact with >>> our admin to add the features to the nodes. >>> >>> Cheers, >>> >>> Rainer >>> >>> On 12/02/12 16:53, Sreedhar Manchu wrote: >>>> Hi Rainer, >>> >>>> Like Ken wrote it is possible with feature property. I use this >>>> feature heavily to place jobs on specific nodes. >>> >>>> To add feature to nodes >>> >>>> for i in {0..5}; do qmgr -c "set node node0$i properties += >>>> arrays"; done >>> >>>> Here feature is arrays. You can replace that with whatever you >>>> like. >>> >>>> Once you've done this you can get array jobs placed on these >>>> nodes by requesting this feature in qsub such as >>> >>>>>>> qsub the_script.sub -t 1-10 -l feature='arrays' >>> >>>> This would put your jobs on the nodes that have property arrays. >>>> In this case the nodes are 0 to 5. >>> >>>> In my case I wrote a qsub wrapper which goes through the pbs >>>> scripts and command line and adds this feature line such as #PBS >>>> -l feature= to the script so that they are placed >>>> on right nodes. This comes very handy especially when you have >>>> nodes with diiferent amounts of memory under the same queue. >>> >>>> If your scheduler is moab you can do really cool stuff using this >>>> feature property. >>> >>>> Hope this helps. >>> >>>> Sreedhar. >>> >>> >>> >>>> On 10-Feb-2012, at 2:49 AM, Rainer M Krug >>> > wrote: >>> >>>> On 09/02/12 23:39, Ken Nielson wrote: >>>>>>> >>>>>>> >>>>>>> ----- Original Message ----- >>>>>>>> From: "Rainer M Krug" >>>>>>> > To: >>>>>>>> torqueusers at supercluster.org >>>>>>>> Sent: Thursday, >>>>>>>> February 9, 2012 2:16:07 AM Subject: [torqueusers] >>>>>>>> Specifying nodes which can be used in array job >>>>>>>> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>>>> node10), of which I am not the administrator. >>>>>>>> >>>>>>>> Some nodes are setup slightly different, so that a >>>>>>>> certain job only runs on nodes node01 to node05. >>>>>>>> >>>>>>>> So I would like to submit an array job and specify "only >>>>>>>> use the node01, node02, node03, node04 or node05 to run >>>>>>>> the each individual job". >>>>>>>> >>>>>>>> How can I do that? I know that I can use -l to specify >>>>>>>> resource requirements, but if I specify nodes=..., >>>>>>>> *each* job will allocate *all* nodes for the job, which >>>>>>>> is not what I want - each individual job should run on >>>>>>>> one of the nodes. >>>>>>>> >>>>>>>> so: >>>>>>>> >>>>>>>> qsub the_script.sub -t 1-10 >>>>>>>> >>>>>>>> and how do I specify the nodes? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Rainer >>>>>>> >>>>>>> Rainer, >>>>>>> >>>>>>> Are there feature (properties) in the nodes files of those >>>>>>> hosts which would allow you to specify a feature on the >>>>>>> qsub line? >>> >>>> No - unfortunately not. >>> >>>>>>> >>>>>>> Ken >>> >>>>> >>>>> _______________________________________________ torqueusers >>>>> mailing list torqueusers at supercluster.org >>>>> >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >>>> _______________________________________________ torqueusers >>>> mailing list torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.11 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >> >> iEYEARECAAYFAk8734IACgkQoYgNqgF2egoY3QCfVqXzOLwjZ+DLqgVQdMU0eLjr >> TVwAn0dQs8G0y+JTIc+lRtP/0oT7GCyo >> =vpZF >> -----END PGP SIGNATURE----- >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From knielson at adaptivecomputing.com Wed Feb 15 11:06:37 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 15 Feb 2012 11:06:37 -0700 (MST) Subject: [torqueusers] New TORQUE 4.0 beta available In-Reply-To: Message-ID: Hi all, We are getting closer. There is a TORQUE 4.0 beta available for download at http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201202151044.tar.gz Since the February 10 beta snapshot we have the following fixes: Added code to im_join_job_as_sister to set the pjob->ji_radix to 2 for leaf nodes in a job radix. This is used later for IM_SPAWN_TASK to get the correct address of an intermediate MOM node in start_process. TRQ-768. Add the -w option to allow pbs_moms to wait until they get the mom hierarchy file from pbs_server to send their first update. change moab_array_compatible server parameter so it defaults to true. TRQ-761 pbs_o variables moved to upper case to support previously working scripts. Bug TRQ-765 Change send_update_within_ten to send_update_soon and make it update interval / 3 Creating a new branch for TORQUE 4.0. The new branch is 4.0-fixes another fix for where pbs_mom would not automatically set gpu mode. Commit a version of Jon's patch to fix a segfault in the log locking messages. It was an off-by-one buffer error Documentation for TORQUE 4.0 can be found at http://www.adaptivecomputing.com/resources/docs/. If you find bugs in the documentation we would like to have those reported as well. Please download and try this version and let us know if you find any issues. Regards Ken From siegert at sfu.ca Wed Feb 15 16:10:31 2012 From: siegert at sfu.ca (Martin Siegert) Date: Wed, 15 Feb 2012 15:10:31 -0800 Subject: [torqueusers] vmem and pvmem Message-ID: <20120215231031.GA956@stikine.sfu.ca> Hi, I am struggling with the implmentation of the vmem and pvmem resources (in principle the exact same concerns apply to mem and pmem). Let's say I set resources_default.pvmem = 1gb in qmgr. Now a user submits a job requesting -l procs=1,vmem=2gb and the job fails because it exceeds the pvmem default. Apparently torque treats vmem and pvmem as two independent resources, which (in particular for 1p jobs) is not very reasonable. Similarly, if I would set resources_default.vmem, jobs that request pvmem fail even if the specified (amount of pvmem)*(no. of requested processors) > vmem default How do people deal with this issue? As far as I can tell moab only uses pmem and pvmem, i.e., moab converts vmem to pvmem = vmem/procct and mem to vmem = mem/procct. Correct? Shouldn't torque do the same? I am worried about shared memory jobs though - jobs were pvmem is not really relevant since all processes share the same memory and vmem is not simply the sum of the process memory usage (at least you cannot add up amounts displayed by ps, etc.). But I do not know whether torque handles this correctly anyway, does it? For now I modified the torque_submitfilter posted by Gareth http://www.clusterresources.com/pipermail/torquedev/2011-March/003479.html (thanks Gareth!) to add a qsub option -l pvmem=... in those cases when the user requests vmem, but this appears to be an ugly workaround. Shouldn't there be a better way? Cheers, Martin -- Martin Siegert Head, Research Computing Simon Fraser University Burnaby, British Columbia From djohnson at osc.edu Wed Feb 15 16:15:08 2012 From: djohnson at osc.edu (Doug Johnson) Date: Wed, 15 Feb 2012 18:15:08 -0500 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead Message-ID: Hi, Has anyone noticed the overhead when enabling GPU support in torque? The nvidia-smi process requires about 4 cpu seconds for each invocation. When executing a non-GPU code that uses all the cores this results in a bit of oversubscription of the cores. Since nvidia-smi is executed every 30 seconds to collect card state this results in a measurable decrease in performance. As a workaround I've enabled 'persistence mode' for the card. When not in use, the card is apparently not initialized. With persistence mode enabled the cpu time to execute the command is reduced to ~0.02. This will also help with the execution time of short kernels, as the card will be ready to go. Do other people run with persistence mode enabled? Are there any downsides? Doug PS. I think if X were running this would not be an issue. From akohlmey at cmm.chem.upenn.edu Wed Feb 15 16:22:00 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Wed, 15 Feb 2012 18:22:00 -0500 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: On Wed, Feb 15, 2012 at 6:15 PM, Doug Johnson wrote: > Hi, > > Has anyone noticed the overhead when enabling GPU support in torque? > The nvidia-smi process requires about 4 cpu seconds for each > invocation. ?When executing a non-GPU code that uses all the cores > this results in a bit of oversubscription of the cores. ?Since > nvidia-smi is executed every 30 seconds to collect card state this > results in a measurable decrease in performance. > > As a workaround I've enabled 'persistence mode' for the card. ?When > not in use, the card is apparently not initialized. ?With persistence > mode enabled the cpu time to execute the command is reduced to ~0.02. > This will also help with the execution time of short kernels, as the > card will be ready to go. this doesn't affect applications usually, since they will open a GPU "context" and keep it open until the application ends. applications that require less time as total runtime are pointless to run on the GPU. > Do other people run with persistence mode enabled? ?Are there any > downsides? yes. this is the way to go. before nvidia was allowing that, i had an nvidia-smi process doing a (very infrequent) log in the background. this carries over other stuff, too. check out: http://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness it doesn't happen on desktops due to running the X server, which also holds a GPU context. cheers, axel. > > Doug > > PS. I think if X were running this would not be an issue. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From dbeer at adaptivecomputing.com Wed Feb 15 16:22:09 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 15 Feb 2012 16:22:09 -0700 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: Doug, Have you tried using the --with-nvml-include= option in configure? This has pbs_mom use the nvidia API for these calls, and should speed things up a bit. The path should be the path to the nvml.h file and is usually: /usr/local/cuda/CUDAToolsSDK/NVML/ David On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson wrote: > Hi, > > Has anyone noticed the overhead when enabling GPU support in torque? > The nvidia-smi process requires about 4 cpu seconds for each > invocation. When executing a non-GPU code that uses all the cores > this results in a bit of oversubscription of the cores. Since > nvidia-smi is executed every 30 seconds to collect card state this > results in a measurable decrease in performance. > > As a workaround I've enabled 'persistence mode' for the card. When > not in use, the card is apparently not initialized. With persistence > mode enabled the cpu time to execute the command is reduced to ~0.02. > This will also help with the execution time of short kernels, as the > card will be ready to go. > > Do other people run with persistence mode enabled? Are there any > downsides? > > Doug > > PS. I think if X were running this would not be an issue. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120215/1b2f8ab0/attachment.html From djohnson at osc.edu Wed Feb 15 16:54:29 2012 From: djohnson at osc.edu (Doug Johnson) Date: Wed, 15 Feb 2012 18:54:29 -0500 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: Hi David, I was going to send a separate email about '--with-nvml-include' once I had more time to look at the problem. It seems that nvml.h no longer exists in the newer versions of the CUDA SDK. We have version 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h and enabling this option in torque results in failure to build. I Haven't had a chance to take a look at older versions or the release notes for descriptions of when this changed. Is it safe to assume that if we were able to use this code, a context to the cards would be kept open by the mom? Doug At Wed, 15 Feb 2012 16:22:09 -0700, David Beer wrote: > > [1 ] > [1.1 ] > > [1.2 ] > Doug, > > Have you tried using the --with-nvml-include= option in configure? This has pbs_mom use the > nvidia API for these calls, and should speed things up a bit. The path should be the path to the nvml.h > file and is usually: > /usr/local/cuda/CUDAToolsSDK/NVML/ > > David > > On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson wrote: > > Hi, > > Has anyone noticed the overhead when enabling GPU support in torque? > The nvidia-smi process requires about 4 cpu seconds for each > invocation. When executing a non-GPU code that uses all the cores > this results in a bit of oversubscription of the cores. Since > nvidia-smi is executed every 30 seconds to collect card state this > results in a measurable decrease in performance. > > As a workaround I've enabled 'persistence mode' for the card. When > not in use, the card is apparently not initialized. With persistence > mode enabled the cpu time to execute the command is reduced to ~0.02. > This will also help with the execution time of short kernels, as the > card will be ready to go. > > Do other people run with persistence mode enabled? Are there any > downsides? > > Doug > > PS. I think if X were running this would not be an issue. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -- > David Beer | Software Engineer > Adaptive Computing > > > [2 ] > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From akohlmey at cmm.chem.upenn.edu Wed Feb 15 16:56:36 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Wed, 15 Feb 2012 18:56:36 -0500 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: On Wed, Feb 15, 2012 at 6:54 PM, Doug Johnson wrote: > Hi David, > > I was going to send a separate email about '--with-nvml-include' once > I had more time to look at the problem. ?It seems that nvml.h no > longer exists in the newer versions of the CUDA SDK. ?We have version http://developer.nvidia.com/nvidia-management-library-NVML axel. > 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h > and enabling this option in torque results in failure to build. ?I > Haven't had a chance to take a look at older versions or the release > notes for descriptions of when this changed. > > Is it safe to assume that if we were able to use this code, a context > to the cards would be kept open by the mom? > > Doug > > At Wed, 15 Feb 2012 16:22:09 -0700, > David Beer wrote: >> >> [1 ?] >> [1.1 ?] >> >> [1.2 ?] >> Doug, >> >> Have you tried using the --with-nvml-include= option in configure? This has pbs_mom use the >> nvidia API for these calls, and should speed things up a bit. The path should be the path to the nvml.h >> file and is usually: >> /usr/local/cuda/CUDAToolsSDK/NVML/ >> >> David >> >> On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson wrote: >> >> ? ? Hi, >> >> ? ? Has anyone noticed the overhead when enabling GPU support in torque? >> ? ? The nvidia-smi process requires about 4 cpu seconds for each >> ? ? invocation. ?When executing a non-GPU code that uses all the cores >> ? ? this results in a bit of oversubscription of the cores. ?Since >> ? ? nvidia-smi is executed every 30 seconds to collect card state this >> ? ? results in a measurable decrease in performance. >> >> ? ? As a workaround I've enabled 'persistence mode' for the card. ?When >> ? ? not in use, the card is apparently not initialized. ?With persistence >> ? ? mode enabled the cpu time to execute the command is reduced to ~0.02. >> ? ? This will also help with the execution time of short kernels, as the >> ? ? card will be ready to go. >> >> ? ? Do other people run with persistence mode enabled? ?Are there any >> ? ? downsides? >> >> ? ? Doug >> >> ? ? PS. I think if X were running this would not be an issue. >> ? ? _______________________________________________ >> ? ? torqueusers mailing list >> ? ? torqueusers at supercluster.org >> ? ? http://www.supercluster.org/mailman/listinfo/torqueusers >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> >> [2 ?] >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From ngsbioinformatics at gmail.com Wed Feb 15 21:48:28 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Wed, 15 Feb 2012 23:48:28 -0500 Subject: [torqueusers] request memory not followed by scheduler Message-ID: Hi - I have a a cluster with several nodes where each node has 16 GB of RAM. I'm submitting jobs that require 8gb of memory on whatever node they are running on. The job script contains the directive: #PBS -l mem=8g I noticed that the second directive doesn't effect the way torque distributes jobs. For instance, if all the nodes in my cluster are taken up, except for 1 node that is free. I can run the job 4 times and instead of seeing 2 jobs running and 2 waiting in the queue, all 4 jobs get executed on that free node. How do I get torque to taken the memory requested into account? Ryan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120215/48424429/attachment.html From R.M.Krug at gmail.com Thu Feb 16 02:33:01 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Thu, 16 Feb 2012 10:33:01 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> References: <4F3BDF82.2060102@gmail.com> <356a6ea7-6c45-44c8-9d3e-31b7ec348f7c@mail> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/02/12 17:50, Ken Nielson wrote: > ----- Original Message ----- >> From: "Rainer M Krug" To: >> torqueusers at supercluster.org Sent: Wednesday, February 15, 2012 >> 9:38:26 AM Subject: Re: [torqueusers] Specifying nodes which can >> be used in array job >> >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >> >> OK - the feature is added, but I don't get it to only use the >> nodes with this feature set. >> >> I have the following simple script named test.sub: >> >> >> #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t >> 1-10 #PBS -l feature='Rnodeff' > > feature is not a recognized keyword in TORQUE. One way (there may > be others) to get the feature you want is to use the following > syntax: -l nodes=2:Rnodeff As I posted already, this is working nicely now. But could you please explain where the "2:" comes from? I don't like to simply copy code without understanding it. Rainer > > Ken -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk88zUwACgkQoYgNqgF2egpuwACfbRRQYID4B2UCTK5ICrexuzJn TMoAnA64onjQmO/z1hAMV8cDdVjvE+0f =16pw -----END PGP SIGNATURE----- From r.m.krug at gmail.com Thu Feb 16 02:35:49 2012 From: r.m.krug at gmail.com (Rainer M Krug) Date: Thu, 16 Feb 2012 10:35:49 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <99AB06E5-9B37-4A78-87F3-98B19229A0B2@ldeo.columbia.edu> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> <4F3BDF82.2060102@gmail.com> <99AB06E5-9B37-4A78-87F3-98B19229A0B2@ldeo.columbia.edu> Message-ID: <4F3CCDF5.90404@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/02/12 18:42, Gustavo Correa wrote: > Rainer: Somehow I receive all your postings to the Torqueusers > list in duplicate, one from R.M.Krug [capitals] another from > r.m.krug [lowercase] at gmail dot com. I don't know why, and it is > not a big deal either. Did you register twice perhaps? not that I know of - but it might be that I hit reply-to-all, which sends to the mailing list and you directly? This one is only via mailinglist - do you also get it twice? Rainer > > Gus Correa > > > On Feb 15, 2012, at 12:35 PM, Gustavo Correa wrote: > >> >> On Feb 15, 2012, at 11:38 AM, Rainer M Krug wrote: >> > OK - the feature is added, but I don't get it to only use the > nodes with this feature set. > > I have the following simple script named test.sub: > > > #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t > 1-10 #PBS -l feature='Rnodeff' echo "---------------------" > > > and I submit it via > > qsub test.sub > > and it always runs, whatever I put into the feature line, even if > the feature property "Rnodeff" does not exist. It always uses all > nodes, even when I use "Rnode" which is set only for a subset of > nodes. > > > I even setup torque on my desktop and set the features to try it. > > >>> >>> There is one way to do what I guess you want to do [i.e. launch >>> jobs on nodes with different properties, right?] by setting up >>> specific queues for each node property. You may need maui or >>> moab for this, I am not sure if it works with pbs_sched. >>> >>> Say, create a queue qRnodeff the usual way. Then add something >>> like this: >>> >>> qmgr -c 'set queue qRnodeff resources_default.neednodes = >>> Rnodeff' >>> >>> where Rnodeff is a property of some of your nodes listed in the >>> $Torque/pbs_server/nodes file. >>> >>> On your maui.cfg add this line: ENABLEMULTIREQJOBS TRUE >>> >>> Restart pbs_server and maui. >>> >>> On your PBS job script just direct the job to the appropriate >>> queue: #PBS -q qRnodeff >>> >>> no need for "#PBS -l Rnodeff" >>> >>> Anyway, for these non-default configurations, it is worth >>> reading the Torque and Maui Admin guides: >>> http://www.adaptivecomputing.com/resources/docs/ >>> >>> I hope this helps. >>> >>> Gus Correa >>> >>> > Any ideas? > > Rainer > > qsub On 13/02/12 09:37, Rainer M Krug wrote: >>>>> Thanks a lot - this definitely helps. I will get in contact >>>>> with our admin to add the features to the nodes. >>>>> >>>>> Cheers, >>>>> >>>>> Rainer >>>>> >>>>> On 12/02/12 16:53, Sreedhar Manchu wrote: >>>>>> Hi Rainer, >>>>> >>>>>> Like Ken wrote it is possible with feature property. I >>>>>> use this feature heavily to place jobs on specific >>>>>> nodes. >>>>> >>>>>> To add feature to nodes >>>>> >>>>>> for i in {0..5}; do qmgr -c "set node node0$i properties >>>>>> += arrays"; done >>>>> >>>>>> Here feature is arrays. You can replace that with >>>>>> whatever you like. >>>>> >>>>>> Once you've done this you can get array jobs placed on >>>>>> these nodes by requesting this feature in qsub such as >>>>> >>>>>>>>> qsub the_script.sub -t 1-10 -l feature='arrays' >>>>> >>>>>> This would put your jobs on the nodes that have property >>>>>> arrays. In this case the nodes are 0 to 5. >>>>> >>>>>> In my case I wrote a qsub wrapper which goes through the >>>>>> pbs scripts and command line and adds this feature line >>>>>> such as #PBS -l feature= to the script so >>>>>> that they are placed on right nodes. This comes very >>>>>> handy especially when you have nodes with diiferent >>>>>> amounts of memory under the same queue. >>>>> >>>>>> If your scheduler is moab you can do really cool stuff >>>>>> using this feature property. >>>>> >>>>>> Hope this helps. >>>>> >>>>>> Sreedhar. >>>>> >>>>> >>>>> >>>>>> On 10-Feb-2012, at 2:49 AM, Rainer M Krug >>>>>> > wrote: >>>>> >>>>>> On 09/02/12 23:39, Ken Nielson wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> ----- Original Message ----- >>>>>>>>>> From: "Rainer M Krug" >>>>>>>>> > To: >>>>>>>>>> torqueusers at supercluster.org >>>>>>>>>> Sent: >>>>>>>>>> Thursday, February 9, 2012 2:16:07 AM Subject: >>>>>>>>>> [torqueusers] Specifying nodes which can be used >>>>>>>>>> in array job >>>>>>>>>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>>>>>> node10), of which I am not the administrator. >>>>>>>>>> >>>>>>>>>> Some nodes are setup slightly different, so that >>>>>>>>>> a certain job only runs on nodes node01 to >>>>>>>>>> node05. >>>>>>>>>> >>>>>>>>>> So I would like to submit an array job and >>>>>>>>>> specify "only use the node01, node02, node03, >>>>>>>>>> node04 or node05 to run the each individual >>>>>>>>>> job". >>>>>>>>>> >>>>>>>>>> How can I do that? I know that I can use -l to >>>>>>>>>> specify resource requirements, but if I specify >>>>>>>>>> nodes=..., *each* job will allocate *all* nodes >>>>>>>>>> for the job, which is not what I want - each >>>>>>>>>> individual job should run on one of the nodes. >>>>>>>>>> >>>>>>>>>> so: >>>>>>>>>> >>>>>>>>>> qsub the_script.sub -t 1-10 >>>>>>>>>> >>>>>>>>>> and how do I specify the nodes? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Rainer >>>>>>>>> >>>>>>>>> Rainer, >>>>>>>>> >>>>>>>>> Are there feature (properties) in the nodes files >>>>>>>>> of those hosts which would allow you to specify a >>>>>>>>> feature on the qsub line? >>>>> >>>>>> No - unfortunately not. >>>>> >>>>>>>>> >>>>>>>>> Ken >>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> torqueusers mailing list torqueusers at supercluster.org >>>>>>> >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>>>> >>>>>>> _______________________________________________ torqueusers >>>>>> mailing list torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk88zfUACgkQoYgNqgF2egpEtACgiqRtn07Qmqhsyrbfig6HESuR gwMAn2Uho2jAHn5SN/6IdHEErol7BHkJ =B1Fu -----END PGP SIGNATURE----- From R.M.Krug at gmail.com Thu Feb 16 02:35:49 2012 From: R.M.Krug at gmail.com (Rainer M Krug) Date: Thu, 16 Feb 2012 10:35:49 +0100 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <99AB06E5-9B37-4A78-87F3-98B19229A0B2@ldeo.columbia.edu> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> <4F3BDF82.2060102@gmail.com> <99AB06E5-9B37-4A78-87F3-98B19229A0B2@ldeo.columbia.edu> Message-ID: <4F3CCDF5.90404@gmail.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/02/12 18:42, Gustavo Correa wrote: > Rainer: Somehow I receive all your postings to the Torqueusers > list in duplicate, one from R.M.Krug [capitals] another from > r.m.krug [lowercase] at gmail dot com. I don't know why, and it is > not a big deal either. Did you register twice perhaps? not that I know of - but it might be that I hit reply-to-all, which sends to the mailing list and you directly? This one is only via mailinglist - do you also get it twice? Rainer > > Gus Correa > > > On Feb 15, 2012, at 12:35 PM, Gustavo Correa wrote: > >> >> On Feb 15, 2012, at 11:38 AM, Rainer M Krug wrote: >> > OK - the feature is added, but I don't get it to only use the > nodes with this feature set. > > I have the following simple script named test.sub: > > > #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t > 1-10 #PBS -l feature='Rnodeff' echo "---------------------" > > > and I submit it via > > qsub test.sub > > and it always runs, whatever I put into the feature line, even if > the feature property "Rnodeff" does not exist. It always uses all > nodes, even when I use "Rnode" which is set only for a subset of > nodes. > > > I even setup torque on my desktop and set the features to try it. > > >>> >>> There is one way to do what I guess you want to do [i.e. launch >>> jobs on nodes with different properties, right?] by setting up >>> specific queues for each node property. You may need maui or >>> moab for this, I am not sure if it works with pbs_sched. >>> >>> Say, create a queue qRnodeff the usual way. Then add something >>> like this: >>> >>> qmgr -c 'set queue qRnodeff resources_default.neednodes = >>> Rnodeff' >>> >>> where Rnodeff is a property of some of your nodes listed in the >>> $Torque/pbs_server/nodes file. >>> >>> On your maui.cfg add this line: ENABLEMULTIREQJOBS TRUE >>> >>> Restart pbs_server and maui. >>> >>> On your PBS job script just direct the job to the appropriate >>> queue: #PBS -q qRnodeff >>> >>> no need for "#PBS -l Rnodeff" >>> >>> Anyway, for these non-default configurations, it is worth >>> reading the Torque and Maui Admin guides: >>> http://www.adaptivecomputing.com/resources/docs/ >>> >>> I hope this helps. >>> >>> Gus Correa >>> >>> > Any ideas? > > Rainer > > qsub On 13/02/12 09:37, Rainer M Krug wrote: >>>>> Thanks a lot - this definitely helps. I will get in contact >>>>> with our admin to add the features to the nodes. >>>>> >>>>> Cheers, >>>>> >>>>> Rainer >>>>> >>>>> On 12/02/12 16:53, Sreedhar Manchu wrote: >>>>>> Hi Rainer, >>>>> >>>>>> Like Ken wrote it is possible with feature property. I >>>>>> use this feature heavily to place jobs on specific >>>>>> nodes. >>>>> >>>>>> To add feature to nodes >>>>> >>>>>> for i in {0..5}; do qmgr -c "set node node0$i properties >>>>>> += arrays"; done >>>>> >>>>>> Here feature is arrays. You can replace that with >>>>>> whatever you like. >>>>> >>>>>> Once you've done this you can get array jobs placed on >>>>>> these nodes by requesting this feature in qsub such as >>>>> >>>>>>>>> qsub the_script.sub -t 1-10 -l feature='arrays' >>>>> >>>>>> This would put your jobs on the nodes that have property >>>>>> arrays. In this case the nodes are 0 to 5. >>>>> >>>>>> In my case I wrote a qsub wrapper which goes through the >>>>>> pbs scripts and command line and adds this feature line >>>>>> such as #PBS -l feature= to the script so >>>>>> that they are placed on right nodes. This comes very >>>>>> handy especially when you have nodes with diiferent >>>>>> amounts of memory under the same queue. >>>>> >>>>>> If your scheduler is moab you can do really cool stuff >>>>>> using this feature property. >>>>> >>>>>> Hope this helps. >>>>> >>>>>> Sreedhar. >>>>> >>>>> >>>>> >>>>>> On 10-Feb-2012, at 2:49 AM, Rainer M Krug >>>>>> > wrote: >>>>> >>>>>> On 09/02/12 23:39, Ken Nielson wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> ----- Original Message ----- >>>>>>>>>> From: "Rainer M Krug" >>>>>>>>> > To: >>>>>>>>>> torqueusers at supercluster.org >>>>>>>>>> Sent: >>>>>>>>>> Thursday, February 9, 2012 2:16:07 AM Subject: >>>>>>>>>> [torqueusers] Specifying nodes which can be used >>>>>>>>>> in array job >>>>>>>>>> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>>>>>> >>>>>>>>>> Hi >>>>>>>>>> >>>>>>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>>>>>> node10), of which I am not the administrator. >>>>>>>>>> >>>>>>>>>> Some nodes are setup slightly different, so that >>>>>>>>>> a certain job only runs on nodes node01 to >>>>>>>>>> node05. >>>>>>>>>> >>>>>>>>>> So I would like to submit an array job and >>>>>>>>>> specify "only use the node01, node02, node03, >>>>>>>>>> node04 or node05 to run the each individual >>>>>>>>>> job". >>>>>>>>>> >>>>>>>>>> How can I do that? I know that I can use -l to >>>>>>>>>> specify resource requirements, but if I specify >>>>>>>>>> nodes=..., *each* job will allocate *all* nodes >>>>>>>>>> for the job, which is not what I want - each >>>>>>>>>> individual job should run on one of the nodes. >>>>>>>>>> >>>>>>>>>> so: >>>>>>>>>> >>>>>>>>>> qsub the_script.sub -t 1-10 >>>>>>>>>> >>>>>>>>>> and how do I specify the nodes? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Rainer >>>>>>>>> >>>>>>>>> Rainer, >>>>>>>>> >>>>>>>>> Are there feature (properties) in the nodes files >>>>>>>>> of those hosts which would allow you to specify a >>>>>>>>> feature on the qsub line? >>>>> >>>>>> No - unfortunately not. >>>>> >>>>>>>>> >>>>>>>>> Ken >>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> torqueusers mailing list torqueusers at supercluster.org >>>>>>> >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>>>> >>>>>>> _______________________________________________ torqueusers >>>>>> mailing list torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ torqueusers >> mailing list torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk88zfUACgkQoYgNqgF2egpEtACgiqRtn07Qmqhsyrbfig6HESuR gwMAn2Uho2jAHn5SN/6IdHEErol7BHkJ =B1Fu -----END PGP SIGNATURE----- From l.flis at cyf-kr.edu.pl Thu Feb 16 02:52:38 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Thu, 16 Feb 2012 10:52:38 +0100 Subject: [torqueusers] torque tmpdir on Lustre filesystem Message-ID: <4F3CD1E6.3050309@cyf-kr.edu.pl> Hello, I would like to ask Lustre users if anyone has pbs_mom variable $tmpdir set to a directory located on Lustre filesystem? Such kind of setup in our case (Lustre 2.1.0 and Lustre 1.8.6) results in Torque being unable to create temprorary directory. Not all attempts are unsuccessful but it is still significant amount. Feb 14 13:56:02 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555591.batch.grid Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18555647.batch.grid Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557701.batch.grid Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in TMakeTmpDir, Unable to make job transient directory: /mnt/lustre/scratch/jobs/18557716.batch.grid This is not a bug in torque but problem with lustre itself. I'm just curious if we're the only site in the world facing the problem. Here is a bug report in jira at whamcloud for more details: http://jira.whamcloud.com/browse/LU-1101 Cheers, -- Lukasz Flis ACC Cyfronet AGH, Cracow, POLAND From gus at ldeo.columbia.edu Thu Feb 16 08:39:26 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 16 Feb 2012 10:39:26 -0500 Subject: [torqueusers] Specifying nodes which can be used in array job In-Reply-To: <4F3CCDF5.90404@gmail.com> References: <4F34CC19.6070805@gmail.com> <35ADCB45-7995-488E-A336-515F89ED63BD@nyu.edu> <4F38CBB7.8000107@gmail.com> <4F3BDF82.2060102@gmail.com> <99AB06E5-9B37-4A78-87F3-98B19229A0B2@ldeo.columbia.edu> <4F3CCDF5.90404@gmail.com> Message-ID: On Feb 16, 2012, at 4:35 AM, Rainer M Krug wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 15/02/12 18:42, Gustavo Correa wrote: >> Rainer: Somehow I receive all your postings to the Torqueusers >> list in duplicate, one from R.M.Krug [capitals] another from >> r.m.krug [lowercase] at gmail dot com. I don't know why, and it is >> not a big deal either. Did you register twice perhaps? > > not that I know of - but it might be that I hit reply-to-all, which > sends to the mailing list and you directly? > > This one is only via mailinglist - do you also get it twice? > > Rainer > I received it twice. Same thing with the capitals/lowercase email addresses. It doesn't bother me, just wondering if it goes twice to all list subscribers. Gus Correa >> >> Gus Correa >> >> >> On Feb 15, 2012, at 12:35 PM, Gustavo Correa wrote: >> >>> >>> On Feb 15, 2012, at 11:38 AM, Rainer M Krug wrote: >>> >> OK - the feature is added, but I don't get it to only use the >> nodes with this feature set. >> >> I have the following simple script named test.sub: >> >> >> #PBS -j oe #PBS -m abe #PBS -M Rainer at krugs.de #PBS -V #PBS -t >> 1-10 #PBS -l feature='Rnodeff' echo "---------------------" >> >> >> and I submit it via >> >> qsub test.sub >> >> and it always runs, whatever I put into the feature line, even if >> the feature property "Rnodeff" does not exist. It always uses all >> nodes, even when I use "Rnode" which is set only for a subset of >> nodes. >> >> >> I even setup torque on my desktop and set the features to try it. >> >> >>>> >>>> There is one way to do what I guess you want to do [i.e. launch >>>> jobs on nodes with different properties, right?] by setting up >>>> specific queues for each node property. You may need maui or >>>> moab for this, I am not sure if it works with pbs_sched. >>>> >>>> Say, create a queue qRnodeff the usual way. Then add something >>>> like this: >>>> >>>> qmgr -c 'set queue qRnodeff resources_default.neednodes = >>>> Rnodeff' >>>> >>>> where Rnodeff is a property of some of your nodes listed in the >>>> $Torque/pbs_server/nodes file. >>>> >>>> On your maui.cfg add this line: ENABLEMULTIREQJOBS TRUE >>>> >>>> Restart pbs_server and maui. >>>> >>>> On your PBS job script just direct the job to the appropriate >>>> queue: #PBS -q qRnodeff >>>> >>>> no need for "#PBS -l Rnodeff" >>>> >>>> Anyway, for these non-default configurations, it is worth >>>> reading the Torque and Maui Admin guides: >>>> http://www.adaptivecomputing.com/resources/docs/ >>>> >>>> I hope this helps. >>>> >>>> Gus Correa >>>> >>>> >> Any ideas? >> >> Rainer >> >> qsub On 13/02/12 09:37, Rainer M Krug wrote: >>>>>> Thanks a lot - this definitely helps. I will get in contact >>>>>> with our admin to add the features to the nodes. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Rainer >>>>>> >>>>>> On 12/02/12 16:53, Sreedhar Manchu wrote: >>>>>>> Hi Rainer, >>>>>> >>>>>>> Like Ken wrote it is possible with feature property. I >>>>>>> use this feature heavily to place jobs on specific >>>>>>> nodes. >>>>>> >>>>>>> To add feature to nodes >>>>>> >>>>>>> for i in {0..5}; do qmgr -c "set node node0$i properties >>>>>>> += arrays"; done >>>>>> >>>>>>> Here feature is arrays. You can replace that with >>>>>>> whatever you like. >>>>>> >>>>>>> Once you've done this you can get array jobs placed on >>>>>>> these nodes by requesting this feature in qsub such as >>>>>> >>>>>>>>>> qsub the_script.sub -t 1-10 -l feature='arrays' >>>>>> >>>>>>> This would put your jobs on the nodes that have property >>>>>>> arrays. In this case the nodes are 0 to 5. >>>>>> >>>>>>> In my case I wrote a qsub wrapper which goes through the >>>>>>> pbs scripts and command line and adds this feature line >>>>>>> such as #PBS -l feature= to the script so >>>>>>> that they are placed on right nodes. This comes very >>>>>>> handy especially when you have nodes with diiferent >>>>>>> amounts of memory under the same queue. >>>>>> >>>>>>> If your scheduler is moab you can do really cool stuff >>>>>>> using this feature property. >>>>>> >>>>>>> Hope this helps. >>>>>> >>>>>>> Sreedhar. >>>>>> >>>>>> >>>>>> >>>>>>> On 10-Feb-2012, at 2:49 AM, Rainer M Krug >>>>>>> > wrote: >>>>>> >>>>>>> On 09/02/12 23:39, Ken Nielson wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ----- Original Message ----- >>>>>>>>>>> From: "Rainer M Krug" >>>>>>>>>> > To: >>>>>>>>>>> torqueusers at supercluster.org >>>>>>>>>>> Sent: >>>>>>>>>>> Thursday, February 9, 2012 2:16:07 AM Subject: >>>>>>>>>>> [torqueusers] Specifying nodes which can be used >>>>>>>>>>> in array job >>>>>>>>>>> >>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >>>>>>>>>>> >>>>>>>>>>> Hi >>>>>>>>>>> >>>>>>>>>>> assuming I have cluster of 10 nodes (node01, ... >>>>>>>>>>> node10), of which I am not the administrator. >>>>>>>>>>> >>>>>>>>>>> Some nodes are setup slightly different, so that >>>>>>>>>>> a certain job only runs on nodes node01 to >>>>>>>>>>> node05. >>>>>>>>>>> >>>>>>>>>>> So I would like to submit an array job and >>>>>>>>>>> specify "only use the node01, node02, node03, >>>>>>>>>>> node04 or node05 to run the each individual >>>>>>>>>>> job". >>>>>>>>>>> >>>>>>>>>>> How can I do that? I know that I can use -l to >>>>>>>>>>> specify resource requirements, but if I specify >>>>>>>>>>> nodes=..., *each* job will allocate *all* nodes >>>>>>>>>>> for the job, which is not what I want - each >>>>>>>>>>> individual job should run on one of the nodes. >>>>>>>>>>> >>>>>>>>>>> so: >>>>>>>>>>> >>>>>>>>>>> qsub the_script.sub -t 1-10 >>>>>>>>>>> >>>>>>>>>>> and how do I specify the nodes? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Rainer >>>>>>>>>> >>>>>>>>>> Rainer, >>>>>>>>>> >>>>>>>>>> Are there feature (properties) in the nodes files >>>>>>>>>> of those hosts which would allow you to specify a >>>>>>>>>> feature on the qsub line? >>>>>> >>>>>>> No - unfortunately not. >>>>>> >>>>>>>>>> >>>>>>>>>> Ken >>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> torqueusers mailing list torqueusers at supercluster.org >>>>>>>> >>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >>>>>> >>>>>>> >>>>>>>> > _______________________________________________ torqueusers >>>>>>> mailing list torqueusers at supercluster.org >>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >>>> >>>> _______________________________________________ torqueusers >>>> mailing list torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> _______________________________________________ torqueusers >>> mailing list torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk88zfUACgkQoYgNqgF2egpEtACgiqRtn07Qmqhsyrbfig6HESuR > gwMAn2Uho2jAHn5SN/6IdHEErol7BHkJ > =B1Fu > -----END PGP SIGNATURE----- > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From christina.salls at noaa.gov Thu Feb 16 13:10:17 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 16 Feb 2012 15:10:17 -0500 Subject: [torqueusers] jobs stuck in queue until I force execution with qrun Message-ID: Hi all, My situation has improved but I am still not there. I can submit a job successfully, but it will stay in the queue until I force execution with qrun. eg. -bash-4.1$ qsub ./example_submit_script_1 22.admin.default.domain -bash-4.1$ qstat -a admin.default.domain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 22.admin.default salls batch ExampleJob -- 1 1 -- 00:01 Q -- .[root at wings ~]# qrun 22 [root at wings ~]# qstat -a admin.default.domain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 R -- [root at wings ~]# qstat -a admin.default.domain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 C 00:00 [root at wings ~]# This is what tracejob output looks like: [root at wings ~]# tracejob 22 /var/spool/torque/mom_logs/20120216: No such file or directory /var/spool/torque/sched_logs/20120216: No matching job records located Job: 22.admin.default.domain 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1 02/16/2012 13:46:51 S Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain, job name = ExampleJob, queue = batch 02/16/2012 13:46:51 A queue=batch 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain 02/16/2012 13:53:53 S Not sending email: User does not want mail of this type. 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 etime=1329421611 start=1329422033 owner=salls at admin.default.domain exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:00 02/16/2012 13:54:03 S Not sending email: User does not want mail of this type. 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:10 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 etime=1329421611 start=1329422033 owner=salls at admin.default.domain exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043 Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:10 This is what the output files look like: -bash-4.1$ more ExampleJob.o22 Thu Feb 16 13:53:53 CST 2012 Thu Feb 16 13:54:03 CST 2012 -bash-4.1$ more ExampleJob.e22 -bash-4.1$ This is my basic server config: [root at wings ~]# qmgr Max open servers: 10239 Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = admin.default.domain set server acl_hosts += wings.glerl.noaa.gov set server managers = root at wings.glerl.noaa.gov set server managers += salls at wings.glerl.noaa.gov set server operators = root at wings.glerl.noaa.gov set server operators += salls at wings.glerl.noaa.gov set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 23 Processes running on server: root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque My sched_config file looks like this. I left the default values as is. [root at wings sched_priv]# more sched_config # This is the config file for the scheduling policy # FORMAT: option: value prime_option # option - the name of what we are changing defined in config.h # value - can be boolean/string/numeric depending on the option # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS # Round Robin - # run a job from each queue before running second job from the # first queue. round_robin: False all # By Queue - # run jobs by queues. # If it is not set, the scheduler will look at all the jobs on # on the server as one large queue, and ignore the queues set # by the administrator # PRIME OPTION by_queue: True prime by_queue: True non_prime # Strict Fifo - # run jobs in strict fifo order. If one job can not run # move onto the next queue and do not run any more jobs # out of that queue even if some jobs in the queue could # be run. # If it is not set, it could very easily starve the large # resource using jobs. # PRIME OPTION strict_fifo: false ALL # # fair_share - schedule jobs based on usage and share values # PRIME OPTION # fair_share: false ALL # Help Starving Jobs - # Jobs which have been waiting a long time will # be considered starving. Once a job is considered # starving, the scheduler will not run any jobs # until it can run all of the starving jobs. # PRIME OPTION help_starving_jobs true ALL # # sort_queues - sort queues by the priority attribute # PRIME OPTION # sort_queues true ALL # # load_balancing - load balance between timesharing nodes # PRIME OPTION # load_balancing: false ALL # sort_by: # key: # to sort the jobs on one key, specify it by sort_by # If multiple sorts are necessary, set sory_by to multi_sort # specify the keys in order of sorting # if round_robin or by_queue is set, the jobs will be sorted in their # respective queues. If not the entire server will be sorted. # different sorts - defined in globals.c # no_sort shortest_job_first longest_job_first smallest_memory_first # largest_memory_first high_priority_first low_priority_first multi_sort # fair_share large_walltime_first short_walltime_first # # PRIME OPTION sort_by: shortest_job_first ALL # filter out prolific debug messages # 256 are DEBUG2 messages # NO PRIME OPTION log_filter: 256 # all queues starting with this value are dedicated time queues # i.e. dedtime or dedicatedtime would be dedtime queues # NO PRIME OPTION dedicated_prefix: ded # ignored queues # you can specify up to 16 queues to be ignored by the scheduler #ignore_queue: queue_name # this defines how long before a job is considered starving. If a job has # been queued for this long, it will be considered starving # NO PRIME OPTION max_starve: 24:00:00 # The following three config values are meaningless with fair share turned off # half_life - the half life of usage for fair share # NO PRIME OPTION half_life: 24:00:00 # unknown_shares - the number of shares for the "unknown" group # NO PRIME OPTION unknown_shares: 10 # sync_time - the amount of time between syncing the usage information to disk # NO PRIME OPTION sync_time: 1:00:00 Any idea what I need to do? Thanks, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/eba41a0d/attachment-0001.html From djohnson at osc.edu Thu Feb 16 13:49:04 2012 From: djohnson at osc.edu (Doug Johnson) Date: Thu, 16 Feb 2012 15:49:04 -0500 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: Axel, thanks for the clarification. David, can you update the documentation to clarify that the Tesla Deployment Kit is needed to for nvml.h? The TDK is not linked to from the normal CUDA download pages, and are a bit obscure. However, when this option is enabled (at least in torque-2.5.10), pbs_mom will immediately exit if the node does not have a gpu. Clusters that have a mix of GPU and non-GPU nodes are common. Could we do something like the following instead? --- mom_server.c~ 2012-01-12 16:34:39.000000000 -0500 +++ mom_server.c 2012-02-16 14:51:17.480860518 -0500 @@ -1255,7 +1255,7 @@ rc = nvmlInit(); - if (rc == NVML_SUCCESS) + if (rc == NVML_SUCCESS || rc == NVML_ERROR_DRIVER_NOT_LOADED) return (TRUE); log_nvml_error (rc, NULL, id); This would allow systems without GPUs to start the same mom as the GPU nodes. Ideally the API would also have an error such as NVML_ERROR_NO_DEVICE that would be returned if no nvidia devices existed in the system (check for pci devices, don't rely on driver initialization failure as that's ambiguous.) Doug At Wed, 15 Feb 2012 18:56:36 -0500, Axel Kohlmeyer wrote: > > On Wed, Feb 15, 2012 at 6:54 PM, Doug Johnson wrote: > > Hi David, > > > > I was going to send a separate email about '--with-nvml-include' once > > I had more time to look at the problem. ?It seems that nvml.h no > > longer exists in the newer versions of the CUDA SDK. ?We have version > > http://developer.nvidia.com/nvidia-management-library-NVML > > axel. > > > 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h > > and enabling this option in torque results in failure to build. ?I > > Haven't had a chance to take a look at older versions or the release > > notes for descriptions of when this changed. > > > > Is it safe to assume that if we were able to use this code, a context > > to the cards would be kept open by the mom? > > > > Doug > > > > At Wed, 15 Feb 2012 16:22:09 -0700, > > David Beer wrote: > >> > >> [1 ?] > >> [1.1 ?] > >> > >> [1.2 ?] > >> Doug, > >> > >> Have you tried using the --with-nvml-include= option in configure? This has pbs_mom use the > >> nvidia API for these calls, and should speed things up a bit. The path should be the path to the nvml.h > >> file and is usually: > >> /usr/local/cuda/CUDAToolsSDK/NVML/ > >> > >> David > >> > >> On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson wrote: > >> > >> ? ? Hi, > >> > >> ? ? Has anyone noticed the overhead when enabling GPU support in torque? > >> ? ? The nvidia-smi process requires about 4 cpu seconds for each > >> ? ? invocation. ?When executing a non-GPU code that uses all the cores > >> ? ? this results in a bit of oversubscription of the cores. ?Since > >> ? ? nvidia-smi is executed every 30 seconds to collect card state this > >> ? ? results in a measurable decrease in performance. > >> > >> ? ? As a workaround I've enabled 'persistence mode' for the card. ?When > >> ? ? not in use, the card is apparently not initialized. ?With persistence > >> ? ? mode enabled the cpu time to execute the command is reduced to ~0.02. > >> ? ? This will also help with the execution time of short kernels, as the > >> ? ? card will be ready to go. > >> > >> ? ? Do other people run with persistence mode enabled? ?Are there any > >> ? ? downsides? > >> > >> ? ? Doug > >> > >> ? ? PS. I think if X were running this would not be an issue. > >> ? ? _______________________________________________ > >> ? ? torqueusers mailing list > >> ? ? torqueusers at supercluster.org > >> ? ? http://www.supercluster.org/mailman/listinfo/torqueusers > >> > >> -- > >> David Beer | Software Engineer > >> Adaptive Computing > >> > >> > >> [2 ?] > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Dr. Axel Kohlmeyer? ? akohlmey at gmail.com > http://sites.google.com/site/akohlmey/ > > Institute for Computational Molecular Science > Temple University, Philadelphia PA, USA. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Thu Feb 16 13:55:29 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 16 Feb 2012 15:55:29 -0500 Subject: [torqueusers] jobs stuck in queue until I force execution with qrun In-Reply-To: References: Message-ID: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> Hi Christina This is just a vague thought, not sure if in the right direction. I am a bit confused about the domain being admin.default.domain Is this the sever name in $TORQUE/server_name on the head node? Is it something else, perhaps the head node FQDN Internet address? How about this line in the compute nodes' $TORQUE/mom_priv/config file: $pbsserver ..... What is the server name that appears there? These items were a source of confusion for me long ago. I don't even remember anymore what was the mistake and how it was fixed, but maybe there is something here. Also, is there any hint of the problem in the $TORQUE/mom_logs files in the compute nodes? How about the /var/log/messages on the compute nodes, any smoking gun there? Can the compute nodes resolve the Torque server name [easy way via /etc/hosts]? Can the Torque server resolve the compute nodes' names [ say in /etc/hosts]? Is there a firewall between the server and the compute nodes? Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration] and Ch 11 [troubleshooting] can help: http://www.adaptivecomputing.com/resources/docs/ I hope this helps, Gus Correa On Feb 16, 2012, at 3:10 PM, Christina Salls wrote: > Hi all, > > My situation has improved but I am still not there. I can submit a job successfully, but it will stay in the queue until I force execution with qrun. > > eg. > > -bash-4.1$ qsub ./example_submit_script_1 > 22.admin.default.domain > -bash-4.1$ qstat -a > > admin.default.domain: > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > 22.admin.default salls batch ExampleJob -- 1 1 -- 00:01 Q -- > > .[root at wings ~]# qrun 22 > [root at wings ~]# qstat -a > > admin.default.domain: > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 R -- > > [root at wings ~]# qstat -a > > admin.default.domain: > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 C 00:00 > [root at wings ~]# > > > This is what tracejob output looks like: > > [root at wings ~]# tracejob 22 > /var/spool/torque/mom_logs/20120216: No such file or directory > /var/spool/torque/sched_logs/20120216: No matching job records located > > Job: 22.admin.default.domain > > 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1 > 02/16/2012 13:46:51 S Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain, > job name = ExampleJob, queue = batch > 02/16/2012 13:46:51 A queue=batch > 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain > 02/16/2012 13:53:53 S Not sending email: User does not want mail of this type. > 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 > etime=1329421611 start=1329422033 owner=salls at admin.default.domain > exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 > Resource_List.nodes=1 Resource_List.walltime=00:01:00 > 02/16/2012 13:54:03 S Not sending email: User does not want mail of this type. > 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb > resources_used.walltime=00:00:10 > 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 > etime=1329421611 start=1329422033 owner=salls at admin.default.domain > exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 > Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043 > Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb > resources_used.walltime=00:00:10 > > > This is what the output files look like: > > -bash-4.1$ more ExampleJob.o22 > Thu Feb 16 13:53:53 CST 2012 > Thu Feb 16 13:54:03 CST 2012 > -bash-4.1$ more ExampleJob.e22 > -bash-4.1$ > > This is my basic server config: > > [root at wings ~]# qmgr > Max open servers: 10239 > Qmgr: print server > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = admin.default.domain > set server acl_hosts += wings.glerl.noaa.gov > set server managers = root at wings.glerl.noaa.gov > set server managers += salls at wings.glerl.noaa.gov > set server operators = root at wings.glerl.noaa.gov > set server operators += salls at wings.glerl.noaa.gov > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 300 > set server next_job_number = 23 > > Processes running on server: > > root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain > root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque > > > My sched_config file looks like this. I left the default values as is. > > [root at wings sched_priv]# more sched_config > > > # This is the config file for the scheduling policy > # FORMAT: option: value prime_option > # option - the name of what we are changing defined in config.h > # value - can be boolean/string/numeric depending on the option > # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS > > # Round Robin - > # run a job from each queue before running second job from the > # first queue. > > round_robin: False all > > > # By Queue - > # run jobs by queues. > # If it is not set, the scheduler will look at all the jobs on > # on the server as one large queue, and ignore the queues set > # by the administrator > # PRIME OPTION > > by_queue: True prime > by_queue: True non_prime > > > # Strict Fifo - > # run jobs in strict fifo order. If one job can not run > # move onto the next queue and do not run any more jobs > # out of that queue even if some jobs in the queue could > # be run. > # If it is not set, it could very easily starve the large > # resource using jobs. > # PRIME OPTION > > strict_fifo: false ALL > > # > # fair_share - schedule jobs based on usage and share values > # PRIME OPTION > # > fair_share: false ALL > > # Help Starving Jobs - > # Jobs which have been waiting a long time will > # be considered starving. Once a job is considered > # starving, the scheduler will not run any jobs > # until it can run all of the starving jobs. > # PRIME OPTION > > help_starving_jobs true ALL > > # > # sort_queues - sort queues by the priority attribute > # PRIME OPTION > # > sort_queues true ALL > > # > # load_balancing - load balance between timesharing nodes > # PRIME OPTION > # > load_balancing: false ALL > > # sort_by: > # key: > # to sort the jobs on one key, specify it by sort_by > # If multiple sorts are necessary, set sory_by to multi_sort > # specify the keys in order of sorting > > # if round_robin or by_queue is set, the jobs will be sorted in their > # respective queues. If not the entire server will be sorted. > > # different sorts - defined in globals.c > # no_sort shortest_job_first longest_job_first smallest_memory_first > # largest_memory_first high_priority_first low_priority_first multi_sort > # fair_share large_walltime_first short_walltime_first > # > # PRIME OPTION > sort_by: shortest_job_first ALL > > # filter out prolific debug messages > # 256 are DEBUG2 messages > # NO PRIME OPTION > log_filter: 256 > > # all queues starting with this value are dedicated time queues > # i.e. dedtime or dedicatedtime would be dedtime queues > # NO PRIME OPTION > dedicated_prefix: ded > > # ignored queues > # you can specify up to 16 queues to be ignored by the scheduler > #ignore_queue: queue_name > > # this defines how long before a job is considered starving. If a job has > # been queued for this long, it will be considered starving > # NO PRIME OPTION > max_starve: 24:00:00 > > # The following three config values are meaningless with fair share turned off > > # half_life - the half life of usage for fair share > # NO PRIME OPTION > half_life: 24:00:00 > > # unknown_shares - the number of shares for the "unknown" group > # NO PRIME OPTION > unknown_shares: 10 > > # sync_time - the amount of time between syncing the usage information to disk > # NO PRIME OPTION > sync_time: 1:00:00 > > > Any idea what I need to do? > > Thanks, > > Christina > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Thu Feb 16 14:05:30 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 16 Feb 2012 16:05:30 -0500 Subject: [torqueusers] jobs stuck in queue until I force execution with qrun In-Reply-To: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> References: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> Message-ID: <733F01B5-7438-411A-8382-F1AE72D1E930@ldeo.columbia.edu> PS - For some diagnostic, you could also try '$TORQUE/bin/pbsnodes' on the server, and '$TORQUE/sbin/momctl -d 3' on the compute nodes. Gus Correa On Feb 16, 2012, at 3:55 PM, Gustavo Correa wrote: > Hi Christina > > This is just a vague thought, not sure if in the right direction. > > I am a bit confused about the domain being admin.default.domain > Is this the sever name in $TORQUE/server_name on the head node? > Is it something else, perhaps the head node FQDN Internet address? > > How about this line in the compute nodes' $TORQUE/mom_priv/config file: > $pbsserver ..... > What is the server name that appears there? > > These items were a source of confusion for me long ago. > I don't even remember anymore > what was the mistake and how it was fixed, but maybe there is something here. > > Also, is there any hint of the problem in the $TORQUE/mom_logs files in the compute nodes? > How about the /var/log/messages on the compute nodes, any smoking gun there? > > Can the compute nodes resolve the Torque server name [easy way via /etc/hosts]? > Can the Torque server resolve the compute nodes' names [ say in /etc/hosts]? > Is there a firewall between the server and the compute nodes? > > Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration] > and Ch 11 [troubleshooting] can help: > > http://www.adaptivecomputing.com/resources/docs/ > > I hope this helps, > Gus Correa > > On Feb 16, 2012, at 3:10 PM, Christina Salls wrote: > >> Hi all, >> >> My situation has improved but I am still not there. I can submit a job successfully, but it will stay in the queue until I force execution with qrun. >> >> eg. >> >> -bash-4.1$ qsub ./example_submit_script_1 >> 22.admin.default.domain >> -bash-4.1$ qstat -a >> >> admin.default.domain: >> Req'd Req'd Elap >> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time >> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- >> 22.admin.default salls batch ExampleJob -- 1 1 -- 00:01 Q -- >> >> .[root at wings ~]# qrun 22 >> [root at wings ~]# qstat -a >> >> admin.default.domain: >> Req'd Req'd Elap >> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time >> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- >> 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 R -- >> >> [root at wings ~]# qstat -a >> >> admin.default.domain: >> Req'd Req'd Elap >> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time >> -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- >> 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 C 00:00 >> [root at wings ~]# >> >> >> This is what tracejob output looks like: >> >> [root at wings ~]# tracejob 22 >> /var/spool/torque/mom_logs/20120216: No such file or directory >> /var/spool/torque/sched_logs/20120216: No matching job records located >> >> Job: 22.admin.default.domain >> >> 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1 >> 02/16/2012 13:46:51 S Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain, >> job name = ExampleJob, queue = batch >> 02/16/2012 13:46:51 A queue=batch >> 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain >> 02/16/2012 13:53:53 S Not sending email: User does not want mail of this type. >> 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 >> etime=1329421611 start=1329422033 owner=salls at admin.default.domain >> exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 >> Resource_List.nodes=1 Resource_List.walltime=00:01:00 >> 02/16/2012 13:54:03 S Not sending email: User does not want mail of this type. >> 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb >> resources_used.walltime=00:00:10 >> 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 >> etime=1329421611 start=1329422033 owner=salls at admin.default.domain >> exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 >> Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043 >> Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb >> resources_used.walltime=00:00:10 >> >> >> This is what the output files look like: >> >> -bash-4.1$ more ExampleJob.o22 >> Thu Feb 16 13:53:53 CST 2012 >> Thu Feb 16 13:54:03 CST 2012 >> -bash-4.1$ more ExampleJob.e22 >> -bash-4.1$ >> >> This is my basic server config: >> >> [root at wings ~]# qmgr >> Max open servers: 10239 >> Qmgr: print server >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch resources_default.nodes = 1 >> set queue batch resources_default.walltime = 01:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = admin.default.domain >> set server acl_hosts += wings.glerl.noaa.gov >> set server managers = root at wings.glerl.noaa.gov >> set server managers += salls at wings.glerl.noaa.gov >> set server operators = root at wings.glerl.noaa.gov >> set server operators += salls at wings.glerl.noaa.gov >> set server default_queue = batch >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server mom_job_sync = True >> set server keep_completed = 300 >> set server next_job_number = 23 >> >> Processes running on server: >> >> root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain >> root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque >> >> >> My sched_config file looks like this. I left the default values as is. >> >> [root at wings sched_priv]# more sched_config >> >> >> # This is the config file for the scheduling policy >> # FORMAT: option: value prime_option >> # option - the name of what we are changing defined in config.h >> # value - can be boolean/string/numeric depending on the option >> # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS >> >> # Round Robin - >> # run a job from each queue before running second job from the >> # first queue. >> >> round_robin: False all >> >> >> # By Queue - >> # run jobs by queues. >> # If it is not set, the scheduler will look at all the jobs on >> # on the server as one large queue, and ignore the queues set >> # by the administrator >> # PRIME OPTION >> >> by_queue: True prime >> by_queue: True non_prime >> >> >> # Strict Fifo - >> # run jobs in strict fifo order. If one job can not run >> # move onto the next queue and do not run any more jobs >> # out of that queue even if some jobs in the queue could >> # be run. >> # If it is not set, it could very easily starve the large >> # resource using jobs. >> # PRIME OPTION >> >> strict_fifo: false ALL >> >> # >> # fair_share - schedule jobs based on usage and share values >> # PRIME OPTION >> # >> fair_share: false ALL >> >> # Help Starving Jobs - >> # Jobs which have been waiting a long time will >> # be considered starving. Once a job is considered >> # starving, the scheduler will not run any jobs >> # until it can run all of the starving jobs. >> # PRIME OPTION >> >> help_starving_jobs true ALL >> >> # >> # sort_queues - sort queues by the priority attribute >> # PRIME OPTION >> # >> sort_queues true ALL >> >> # >> # load_balancing - load balance between timesharing nodes >> # PRIME OPTION >> # >> load_balancing: false ALL >> >> # sort_by: >> # key: >> # to sort the jobs on one key, specify it by sort_by >> # If multiple sorts are necessary, set sory_by to multi_sort >> # specify the keys in order of sorting >> >> # if round_robin or by_queue is set, the jobs will be sorted in their >> # respective queues. If not the entire server will be sorted. >> >> # different sorts - defined in globals.c >> # no_sort shortest_job_first longest_job_first smallest_memory_first >> # largest_memory_first high_priority_first low_priority_first multi_sort >> # fair_share large_walltime_first short_walltime_first >> # >> # PRIME OPTION >> sort_by: shortest_job_first ALL >> >> # filter out prolific debug messages >> # 256 are DEBUG2 messages >> # NO PRIME OPTION >> log_filter: 256 >> >> # all queues starting with this value are dedicated time queues >> # i.e. dedtime or dedicatedtime would be dedtime queues >> # NO PRIME OPTION >> dedicated_prefix: ded >> >> # ignored queues >> # you can specify up to 16 queues to be ignored by the scheduler >> #ignore_queue: queue_name >> >> # this defines how long before a job is considered starving. If a job has >> # been queued for this long, it will be considered starving >> # NO PRIME OPTION >> max_starve: 24:00:00 >> >> # The following three config values are meaningless with fair share turned off >> >> # half_life - the half life of usage for fair share >> # NO PRIME OPTION >> half_life: 24:00:00 >> >> # unknown_shares - the number of shares for the "unknown" group >> # NO PRIME OPTION >> unknown_shares: 10 >> >> # sync_time - the amount of time between syncing the usage information to disk >> # NO PRIME OPTION >> sync_time: 1:00:00 >> >> >> Any idea what I need to do? >> >> Thanks, >> >> Christina >> >> >> -- >> Christina A. Salls >> GLERL Computer Group >> help.glerl at noaa.gov >> Help Desk x2127 >> Christina.Salls at noaa.gov >> Voice Mail 734-741-2446 >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From christina.salls at noaa.gov Thu Feb 16 15:04:18 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 16 Feb 2012 17:04:18 -0500 Subject: [torqueusers] jobs stuck in queue until I force execution with qrun In-Reply-To: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> References: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> Message-ID: On Thu, Feb 16, 2012 at 3:55 PM, Gustavo Correa wrote: > Hi Christina > > This is just a vague thought, not sure if in the right direction. > > I am a bit confused about the domain being admin.default.domain > Is this the sever name in $TORQUE/server_name on the head node? > Yes, this is the name server's second interface, on the private network to the compute nodes, and it is the name in the Torque/server_name file on the head node and compute nodes. [root at wings torque]# more server_name admin.default.domain [root at n001 torque]# more server_name admin.default.domain > Is it something else, perhaps the head node FQDN Internet address? > > > How about this line in the compute nodes' $TORQUE/mom_priv/config file: > $pbsserver ..... > What is the server name that appears there? > oh oh!! There is no /var/spool/torque/mom_priv/config file!! What should that look like? > > These items were a source of confusion for me long ago. > I don't even remember anymore > what was the mistake and how it was fixed, but maybe there is something > here. > Also, is there any hint of the problem in the $TORQUE/mom_logs files in > the compute nodes? > 02/16/2012 13:23:18;0002; pbs_mom;Svr;im_eof;End of File from addr 10.0.10.1:15001 02/16/2012 13:23:18;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server admin.default.domain 02/16/2012 13:23:29;0001; pbs_mom;Svr;mom_server_valid_message_source;duplicate connection from 10.0.10.1:1023 - cl osing original connection 02/16/2012 13:24:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:29:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:34:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:39:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:44:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:49:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:53:53;0001; pbs_mom;Job;TMomFinalizeJob3;job 22.admin.default.domain started, pid = 30429 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;scan_for_terminated: job 22.admin.default.domain task 1 terminated, sid=30429 02/16/2012 13:54:03;0008; pbs_mom;Job;22.admin.default.domain;job was terminated 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;obit sent to server 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;removed job script 02/16/2012 13:54:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 13:59:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 14:04:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 02/16/2012 14:09:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > How about the /var/log/messages on the compute nodes, any smoking gun > there? > AHA!! This might be a clue!! Feb 16 03:06:03 n001 rpc.idmapd[2776]: nss_getpwnam: name ' root at glerl.noaa.gov' does not map into domain 'default.domain' Feb 16 10:14:54 n001 rpc.idmapd[2776]: nss_getpwnam: name ' root at glerl.noaa.gov' does not map into domain 'default.domain' Feb 16 15:49:23 n001 rpc.idmapd[2776]: nss_getpwnam: name ' root at glerl.noaa.gov' does not map into domain 'default.domain' [root at n001 mom_logs]# > > Can the compute nodes resolve the Torque server name [easy way via > /etc/hosts]? > yes >From /etc/hosts file # Management Entries 10.0.10.1 admin.default.domain admin loghost 192.168.20.1 admin-ib.default.domain admin-ib loghost-ib # Ethernet Node Entries 10.0.1.1 n001.default.domain n001 10.0.1.2 n002.default.domain n002 10.0.1.3 n003.default.domain n003 ......... > Can the Torque server resolve the compute nodes' names [ say in > /etc/hosts]? > yes >From the /etc/hosts file on the server # Management Entries 10.0.10.1 admin.default.domain admin loghost 192.168.20.1 admin-ib.default.domain admin-ib loghost-ib # Ethernet Node Entries 10.0.1.1 n001.default.domain n001 10.0.1.2 n002.default.domain n002 10.0.1.3 n003.default.domain n003 10.0.1.4 n004.default.domain n004 > Is there a firewall between the server and the compute nodes? > no firewall enabled. > > Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration] > and Ch 11 [troubleshooting] can help: > > http://www.adaptivecomputing.com/resources/docs/ > > I hope this helps, > Thanks Gus!! I will review the Admin Guide. It is what I used to do the setup but I have been changing things right and left! I have also read the troubleshooting guide to no avail. Back to the drawing board. Gus Correa > > On Feb 16, 2012, at 3:10 PM, Christina Salls wrote: > > > Hi all, > > > > My situation has improved but I am still not there. I can > submit a job successfully, but it will stay in the queue until I force > execution with qrun. > > > > eg. > > > > -bash-4.1$ qsub ./example_submit_script_1 > > 22.admin.default.domain > > -bash-4.1$ qstat -a > > > > admin.default.domain: > > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > > 22.admin.default salls batch ExampleJob -- 1 1 > -- 00:01 Q -- > > > > .[root at wings ~]# qrun 22 > > [root at wings ~]# qstat -a > > > > admin.default.domain: > > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > > 22.admin.default salls batch ExampleJob 30429 1 1 > -- 00:01 R -- > > > > [root at wings ~]# qstat -a > > > > admin.default.domain: > > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK > Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- > ------ ----- - ----- > > 22.admin.default salls batch ExampleJob 30429 1 1 > -- 00:01 C 00:00 > > [root at wings ~]# > > > > > > This is what tracejob output looks like: > > > > [root at wings ~]# tracejob 22 > > /var/spool/torque/mom_logs/20120216: No such file or directory > > /var/spool/torque/sched_logs/20120216: No matching job records located > > > > Job: 22.admin.default.domain > > > > 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1 > > 02/16/2012 13:46:51 S Job Queued at request of > salls at admin.default.domain, owner = salls at admin.default.domain, > > job name = ExampleJob, queue = batch > > 02/16/2012 13:46:51 A queue=batch > > 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain > > 02/16/2012 13:53:53 S Not sending email: User does not want mail of > this type. > > 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob > queue=batch ctime=1329421611 qtime=1329421611 > > etime=1329421611 start=1329422033 > owner=salls at admin.default.domain > > exec_host=n001.default.domain/0 > Resource_List.neednodes=1 Resource_List.nodect=1 > > Resource_List.nodes=1 > Resource_List.walltime=00:01:00 > > 02/16/2012 13:54:03 S Not sending email: User does not want mail of > this type. > > 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 > resources_used.mem=0kb resources_used.vmem=0kb > > resources_used.walltime=00:00:10 > > 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob > queue=batch ctime=1329421611 qtime=1329421611 > > etime=1329421611 start=1329422033 > owner=salls at admin.default.domain > > exec_host=n001.default.domain/0 > Resource_List.neednodes=1 Resource_List.nodect=1 > > Resource_List.nodes=1 > Resource_List.walltime=00:01:00 session=30429 end=1329422043 > > Exit_status=0 resources_used.cput=00:00:00 > resources_used.mem=0kb resources_used.vmem=0kb > > resources_used.walltime=00:00:10 > > > > > > This is what the output files look like: > > > > -bash-4.1$ more ExampleJob.o22 > > Thu Feb 16 13:53:53 CST 2012 > > Thu Feb 16 13:54:03 CST 2012 > > -bash-4.1$ more ExampleJob.e22 > > -bash-4.1$ > > > > This is my basic server config: > > > > [root at wings ~]# qmgr > > Max open servers: 10239 > > Qmgr: print server > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch resources_default.nodes = 1 > > set queue batch resources_default.walltime = 01:00:00 > > set queue batch enabled = True > > set queue batch started = True > > # > > # Set server attributes. > > # > > set server scheduling = True > > set server acl_hosts = admin.default.domain > > set server acl_hosts += wings.glerl.noaa.gov > > set server managers = root at wings.glerl.noaa.gov > > set server managers += salls at wings.glerl.noaa.gov > > set server operators = root at wings.glerl.noaa.gov > > set server operators += salls at wings.glerl.noaa.gov > > set server default_queue = batch > > set server log_events = 511 > > set server mail_from = adm > > set server scheduler_iteration = 600 > > set server node_check_rate = 150 > > set server tcp_timeout = 6 > > set server mom_job_sync = True > > set server keep_completed = 300 > > set server next_job_number = 23 > > > > Processes running on server: > > > > root 32086 1 0 13:23 ? 00:00:00 > /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain > > root 32173 1 0 13:23 ? 00:00:00 > /usr/local/sbin/pbs_sched -d /var/spool/torque > > > > > > My sched_config file looks like this. I left the default values as is. > > > > [root at wings sched_priv]# more sched_config > > > > > > # This is the config file for the scheduling policy > > # FORMAT: option: value prime_option > > # option - the name of what we are changing defined in > config.h > > # value - can be boolean/string/numeric depending on the > option > > # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS > > > > # Round Robin - > > # run a job from each queue before running second job from the > > # first queue. > > > > round_robin: False all > > > > > > # By Queue - > > # run jobs by queues. > > # If it is not set, the scheduler will look at all the jobs on > > # on the server as one large queue, and ignore the queues set > > # by the administrator > > # PRIME OPTION > > > > by_queue: True prime > > by_queue: True non_prime > > > > > > # Strict Fifo - > > # run jobs in strict fifo order. If one job can not run > > # move onto the next queue and do not run any more jobs > > # out of that queue even if some jobs in the queue could > > # be run. > > # If it is not set, it could very easily starve the large > > # resource using jobs. > > # PRIME OPTION > > > > strict_fifo: false ALL > > > > # > > # fair_share - schedule jobs based on usage and share values > > # PRIME OPTION > > # > > fair_share: false ALL > > > > # Help Starving Jobs - > > # Jobs which have been waiting a long time will > > # be considered starving. Once a job is considered > > # starving, the scheduler will not run any jobs > > # until it can run all of the starving jobs. > > # PRIME OPTION > > > > help_starving_jobs true ALL > > > > # > > # sort_queues - sort queues by the priority attribute > > # PRIME OPTION > > # > > sort_queues true ALL > > > > # > > # load_balancing - load balance between timesharing nodes > > # PRIME OPTION > > # > > load_balancing: false ALL > > > > # sort_by: > > # key: > > # to sort the jobs on one key, specify it by sort_by > > # If multiple sorts are necessary, set sory_by to multi_sort > > # specify the keys in order of sorting > > > > # if round_robin or by_queue is set, the jobs will be sorted in their > > # respective queues. If not the entire server will be sorted. > > > > # different sorts - defined in globals.c > > # no_sort shortest_job_first longest_job_first smallest_memory_first > > # largest_memory_first high_priority_first low_priority_first multi_sort > > # fair_share large_walltime_first short_walltime_first > > # > > # PRIME OPTION > > sort_by: shortest_job_first ALL > > > > # filter out prolific debug messages > > # 256 are DEBUG2 messages > > # NO PRIME OPTION > > log_filter: 256 > > > > # all queues starting with this value are dedicated time queues > > # i.e. dedtime or dedicatedtime would be dedtime queues > > # NO PRIME OPTION > > dedicated_prefix: ded > > > > # ignored queues > > # you can specify up to 16 queues to be ignored by the scheduler > > #ignore_queue: queue_name > > > > # this defines how long before a job is considered starving. If a job > has > > # been queued for this long, it will be considered starving > > # NO PRIME OPTION > > max_starve: 24:00:00 > > > > # The following three config values are meaningless with fair share > turned off > > > > # half_life - the half life of usage for fair share > > # NO PRIME OPTION > > half_life: 24:00:00 > > > > # unknown_shares - the number of shares for the "unknown" group > > # NO PRIME OPTION > > unknown_shares: 10 > > > > # sync_time - the amount of time between syncing the usage information > to disk > > # NO PRIME OPTION > > sync_time: 1:00:00 > > > > > > Any idea what I need to do? > > > > Thanks, > > > > Christina > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/117b08a5/attachment-0001.html From christina.salls at noaa.gov Thu Feb 16 15:19:20 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Thu, 16 Feb 2012 17:19:20 -0500 Subject: [torqueusers] jobs stuck in queue until I force execution with qrun In-Reply-To: <733F01B5-7438-411A-8382-F1AE72D1E930@ldeo.columbia.edu> References: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> <733F01B5-7438-411A-8382-F1AE72D1E930@ldeo.columbia.edu> Message-ID: On Thu, Feb 16, 2012 at 4:05 PM, Gustavo Correa wrote: > PS - For some diagnostic, you could also try '$TORQUE/bin/pbsnodes' on the > server, > [root at wings ~]# pbsnodes n001.default.domain state = free np = 1 ntype = cluster status = rectime=1329430696,varattr=,jobs=,state=free,netload=42970654,gres=,loadave=0.03,ncpus=24,physmem=20463136kb,availmem=27788364kb,totmem=28655128kb,idletime=177266,nusers=1,nsessions=1,sessions=17382,uname=Linux n001 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 n002.default.domain state = free np = 1 ntype = cluster status = rectime=1329430653,varattr=,jobs=,state=free,netload=41152440,gres=,loadave=0.00,ncpus=24,physmem=24600084kb,availmem=31877036kb,totmem=32792076kb,idletime=177252,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux n002 2.6.32-131.0.15.el6.x86_64 #1 SMP Tue May 10 15:42:40 EDT 2011 x86_64,opsys=linux gpus = 0 These look good, right? > and '$TORQUE/sbin/momctl -d 3' on the compute nodes. > [root at n001 sbin]# momctl -d 3 Host: n001/n001.default.domain Version: 2.5.9 PID: 3598 Server[0]: admin.default.domain (10.0.10.1:1023) Init Msgs Received: 2 hellos/2 cluster-addrs Init Msgs Sent: 6 hellos Last Msg From Server: 8595 seconds (DeleteJob) Last Msg To Server: 32 seconds HomeDirectory: /var/spool/torque/mom_priv stdout/stderr spool directory: '/var/spool/torque/spool/' (23252610 blocks available) NOTE: syslog enabled MOM active: 176853 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: RPP MemLocked: TRUE (mlock) TCP Timeout: 20 seconds Prolog: /var/spool/torque/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds Trusted Client List: 10.0.1.20,10.0.1.19,10.0.1.18,10.0.1.17,10.0.1.16,10.0.1.15,10.0.1.14,10.0.1.13,10.0.1.12,10.0.1.11,10.0.1.10,10.0.1.9,10.0.1.8,10.0.1.7,10.0.1.6,10.0.1.5,10.0.1.4,10.0.1.3,10.0.1.2,10.0.10.1,10.0.1.1,127.0.0.1 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete > Gus Correa > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120216/b2314cf0/attachment.html From jjc at iastate.edu Thu Feb 16 16:05:28 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Thu, 16 Feb 2012 23:05:28 +0000 Subject: [torqueusers] Showing feature properties of all nodes? In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210196AD8E@ITSDAG1D.its.iastate.edu> pbsnodes -a >-----Original Message----- >From: torqueusers-bounces at supercluster.org [mailto:torqueusers- >bounces at supercluster.org] On Behalf Of Rainer M Krug >Sent: Wednesday, February 15, 2012 7:31 AM >To: torqueusers at supercluster.org >Subject: [torqueusers] Showing feature properties of all nodes? > >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >Hi > >possibly simple question: How can I see the feature properties of >all nodes? I an just a normal user. > >Thanks, > >Rainer >-----BEGIN PGP SIGNATURE----- >Version: GnuPG v1.4.11 (GNU/Linux) >Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > >iEYEARECAAYFAk87s4EACgkQoYgNqgF2egoT0wCZAZlQ9P7x9+6w4CrnXHmJADTf >GGwAn0LoVoqhQKxLZ3mtATmUK4JGScyt >=8nVF >-----END PGP SIGNATURE----- > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Thu Feb 16 17:51:34 2012 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Thu, 16 Feb 2012 19:51:34 -0500 Subject: [torqueusers] jobs stuck in queue until I force execution with qrun In-Reply-To: References: <637EFDC6-324B-4134-B57C-E63886B136CB@ldeo.columbia.edu> Message-ID: Hi Christina On Feb 16, 2012, at 5:04 PM, Christina Salls wrote: > > > On Thu, Feb 16, 2012 at 3:55 PM, Gustavo Correa wrote: > Hi Christina > > This is just a vague thought, not sure if in the right direction. > > I am a bit confused about the domain being admin.default.domain > Is this the sever name in $TORQUE/server_name on the head node? > > Yes, this is the name server's second interface, on the private network to the compute nodes, and it is the name in the Torque/server_name file on the head node and compute nodes. > > [root at wings torque]# more server_name > admin.default.domain > [root at n001 torque]# more server_name > admin.default.domain > > Is it something else, perhaps the head node FQDN Internet address? > > > > How about this line in the compute nodes' $TORQUE/mom_priv/config file: > $pbsserver ..... > What is the server name that appears there? > > oh oh!! There is no /var/spool/torque/mom_priv/config file!! What should that look like? > Something typical: $pbsserver name_of_server_in_the_local_subnet [probably 'admin' for you] $usecp *:/home /home [the second line is for shared / NFS mounted directories, to copy files with cp rather than scp, one line per directory/filesystem] > These items were a source of confusion for me long ago. > I don't even remember anymore > what was the mistake and how it was fixed, but maybe there is something here. > > Also, is there any hint of the problem in the $TORQUE/mom_logs files in the compute nodes? > > 02/16/2012 13:23:18;0002; pbs_mom;Svr;im_eof;End of File from addr 10.0.10.1:15001 > 02/16/2012 13:23:18;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server admin.default.domain > 02/16/2012 13:23:29;0001; pbs_mom;Svr;mom_server_valid_message_source;duplicate connection from 10.0.10.1:1023 - cl > osing original connection > 02/16/2012 13:24:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:29:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:34:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:39:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:44:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:49:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:53:53;0001; pbs_mom;Job;TMomFinalizeJob3;job 22.admin.default.domain started, pid = 30429 > 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;scan_for_terminated: job 22.admin.default.domain task > 1 terminated, sid=30429 > 02/16/2012 13:54:03;0008; pbs_mom;Job;22.admin.default.domain;job was terminated > 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 02/16/2012 13:54:03;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;obit sent to server > 02/16/2012 13:54:03;0080; pbs_mom;Job;22.admin.default.domain;removed job script > 02/16/2012 13:54:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 13:59:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 14:04:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > 02/16/2012 14:09:46;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.5.9, loglevel = 0 > > How about the /var/log/messages on the compute nodes, any smoking gun there? > > AHA!! This might be a clue!! > > Feb 16 03:06:03 n001 rpc.idmapd[2776]: nss_getpwnam: name 'root at glerl.noaa.gov' does not map into domain 'default.domain' > Feb 16 10:14:54 n001 rpc.idmapd[2776]: nss_getpwnam: name 'root at glerl.noaa.gov' does not map into domain 'default.domain' > Feb 16 15:49:23 n001 rpc.idmapd[2776]: nss_getpwnam: name 'root at glerl.noaa.gov' does not map into domain 'default.domain' > [root at n001 mom_logs]# > > Here I use the FQDN as the server name in $TORQUE/server_name on the head node [but in mom_priv/config the server local name]. My guess is that the password issue above may be because you are using the server local name in server_name instead. May be worth trying glerl.nooa.gov as server name, at least for a test, restart everything, try ... > Can the compute nodes resolve the Torque server name [easy way via /etc/hosts]? > > yes > > From /etc/hosts file > > # Management Entries > > 10.0.10.1 admin.default.domain admin loghost > 192.168.20.1 admin-ib.default.domain admin-ib loghost-ib > > # Ethernet Node Entries > > 10.0.1.1 n001.default.domain n001 > 10.0.1.2 n002.default.domain n002 > 10.0.1.3 n003.default.domain n003 > ......... > Can the Torque server resolve the compute nodes' names [ say in /etc/hosts]? > > yes Sure, /etc/hosts looks right. > > From the /etc/hosts file on the server > > # Management Entries > > 10.0.10.1 admin.default.domain admin loghost > 192.168.20.1 admin-ib.default.domain admin-ib loghost-ib > > # Ethernet Node Entries > > 10.0.1.1 n001.default.domain n001 > 10.0.1.2 n002.default.domain n002 > 10.0.1.3 n003.default.domain n003 > 10.0.1.4 n004.default.domain n004 > Again looks right > > Is there a firewall between the server and the compute nodes? > > no firewall enabled. > Except between the server and the Internet, I presume. :) [NOAA asks me tons of RSA tokens and passwords to get to any computer there ... :) ] > Maybe the Torque Admin Guide, Ch. 1 [overview/installation/configuration] > and Ch 11 [troubleshooting] can help: > > http://www.adaptivecomputing.com/resources/docs/ > > I hope this helps, > > Thanks Gus!! I will review the Admin Guide. It is what I used to do the setup but I have been changing things right and left! > I have also read the troubleshooting guide to no avail. Back to the drawing board. > That is not perfect, but it is quite useful documentation. Gus Correa > Gus Correa > > On Feb 16, 2012, at 3:10 PM, Christina Salls wrote: > > > Hi all, > > > > My situation has improved but I am still not there. I can submit a job successfully, but it will stay in the queue until I force execution with qrun. > > > > eg. > > > > -bash-4.1$ qsub ./example_submit_script_1 > > 22.admin.default.domain > > -bash-4.1$ qstat -a > > > > admin.default.domain: > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > > 22.admin.default salls batch ExampleJob -- 1 1 -- 00:01 Q -- > > > > .[root at wings ~]# qrun 22 > > [root at wings ~]# qstat -a > > > > admin.default.domain: > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > > 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 R -- > > > > [root at wings ~]# qstat -a > > > > admin.default.domain: > > Req'd Req'd Elap > > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > > -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- > > 22.admin.default salls batch ExampleJob 30429 1 1 -- 00:01 C 00:00 > > [root at wings ~]# > > > > > > This is what tracejob output looks like: > > > > [root at wings ~]# tracejob 22 > > /var/spool/torque/mom_logs/20120216: No such file or directory > > /var/spool/torque/sched_logs/20120216: No matching job records located > > > > Job: 22.admin.default.domain > > > > 02/16/2012 13:46:51 S enqueuing into batch, state 1 hop 1 > > 02/16/2012 13:46:51 S Job Queued at request of salls at admin.default.domain, owner = salls at admin.default.domain, > > job name = ExampleJob, queue = batch > > 02/16/2012 13:46:51 A queue=batch > > 02/16/2012 13:53:53 S Job Run at request of root at admin.default.domain > > 02/16/2012 13:53:53 S Not sending email: User does not want mail of this type. > > 02/16/2012 13:53:53 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 > > etime=1329421611 start=1329422033 owner=salls at admin.default.domain > > exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 > > Resource_List.nodes=1 Resource_List.walltime=00:01:00 > > 02/16/2012 13:54:03 S Not sending email: User does not want mail of this type. > > 02/16/2012 13:54:03 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb > > resources_used.walltime=00:00:10 > > 02/16/2012 13:54:03 A user=salls group=man jobname=ExampleJob queue=batch ctime=1329421611 qtime=1329421611 > > etime=1329421611 start=1329422033 owner=salls at admin.default.domain > > exec_host=n001.default.domain/0 Resource_List.neednodes=1 Resource_List.nodect=1 > > Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=30429 end=1329422043 > > Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb > > resources_used.walltime=00:00:10 > > > > > > This is what the output files look like: > > > > -bash-4.1$ more ExampleJob.o22 > > Thu Feb 16 13:53:53 CST 2012 > > Thu Feb 16 13:54:03 CST 2012 > > -bash-4.1$ more ExampleJob.e22 > > -bash-4.1$ > > > > This is my basic server config: > > > > [root at wings ~]# qmgr > > Max open servers: 10239 > > Qmgr: print server > > # > > # Create queues and set their attributes. > > # > > # > > # Create and define queue batch > > # > > create queue batch > > set queue batch queue_type = Execution > > set queue batch resources_default.nodes = 1 > > set queue batch resources_default.walltime = 01:00:00 > > set queue batch enabled = True > > set queue batch started = True > > # > > # Set server attributes. > > # > > set server scheduling = True > > set server acl_hosts = admin.default.domain > > set server acl_hosts += wings.glerl.noaa.gov > > set server managers = root at wings.glerl.noaa.gov > > set server managers += salls at wings.glerl.noaa.gov > > set server operators = root at wings.glerl.noaa.gov > > set server operators += salls at wings.glerl.noaa.gov > > set server default_queue = batch > > set server log_events = 511 > > set server mail_from = adm > > set server scheduler_iteration = 600 > > set server node_check_rate = 150 > > set server tcp_timeout = 6 > > set server mom_job_sync = True > > set server keep_completed = 300 > > set server next_job_number = 23 > > > > Processes running on server: > > > > root 32086 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain > > root 32173 1 0 13:23 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque > > > > > > My sched_config file looks like this. I left the default values as is. > > > > [root at wings sched_priv]# more sched_config > > > > > > # This is the config file for the scheduling policy > > # FORMAT: option: value prime_option > > # option - the name of what we are changing defined in config.h > > # value - can be boolean/string/numeric depending on the option > > # prime_option - can be prime/non_prime/all ONLY FOR SOME OPTIONS > > > > # Round Robin - > > # run a job from each queue before running second job from the > > # first queue. > > > > round_robin: False all > > > > > > # By Queue - > > # run jobs by queues. > > # If it is not set, the scheduler will look at all the jobs on > > # on the server as one large queue, and ignore the queues set > > # by the administrator > > # PRIME OPTION > > > > by_queue: True prime > > by_queue: True non_prime > > > > > > # Strict Fifo - > > # run jobs in strict fifo order. If one job can not run > > # move onto the next queue and do not run any more jobs > > # out of that queue even if some jobs in the queue could > > # be run. > > # If it is not set, it could very easily starve the large > > # resource using jobs. > > # PRIME OPTION > > > > strict_fifo: false ALL > > > > # > > # fair_share - schedule jobs based on usage and share values > > # PRIME OPTION > > # > > fair_share: false ALL > > > > # Help Starving Jobs - > > # Jobs which have been waiting a long time will > > # be considered starving. Once a job is considered > > # starving, the scheduler will not run any jobs > > # until it can run all of the starving jobs. > > # PRIME OPTION > > > > help_starving_jobs true ALL > > > > # > > # sort_queues - sort queues by the priority attribute > > # PRIME OPTION > > # > > sort_queues true ALL > > > > # > > # load_balancing - load balance between timesharing nodes > > # PRIME OPTION > > # > > load_balancing: false ALL > > > > # sort_by: > > # key: > > # to sort the jobs on one key, specify it by sort_by > > # If multiple sorts are necessary, set sory_by to multi_sort > > # specify the keys in order of sorting > > > > # if round_robin or by_queue is set, the jobs will be sorted in their > > # respective queues. If not the entire server will be sorted. > > > > # different sorts - defined in globals.c > > # no_sort shortest_job_first longest_job_first smallest_memory_first > > # largest_memory_first high_priority_first low_priority_first multi_sort > > # fair_share large_walltime_first short_walltime_first > > # > > # PRIME OPTION > > sort_by: shortest_job_first ALL > > > > # filter out prolific debug messages > > # 256 are DEBUG2 messages > > # NO PRIME OPTION > > log_filter: 256 > > > > # all queues starting with this value are dedicated time queues > > # i.e. dedtime or dedicatedtime would be dedtime queues > > # NO PRIME OPTION > > dedicated_prefix: ded > > > > # ignored queues > > # you can specify up to 16 queues to be ignored by the scheduler > > #ignore_queue: queue_name > > > > # this defines how long before a job is considered starving. If a job has > > # been queued for this long, it will be considered starving > > # NO PRIME OPTION > > max_starve: 24:00:00 > > > > # The following three config values are meaningless with fair share turned off > > > > # half_life - the half life of usage for fair share > > # NO PRIME OPTION > > half_life: 24:00:00 > > > > # unknown_shares - the number of shares for the "unknown" group > > # NO PRIME OPTION > > unknown_shares: 10 > > > > # sync_time - the amount of time between syncing the usage information to disk > > # NO PRIME OPTION > > sync_time: 1:00:00 > > > > > > Any idea what I need to do? > > > > Thanks, > > > > Christina > > > > > > -- > > Christina A. Salls > > GLERL Computer Group > > help.glerl at noaa.gov > > Help Desk x2127 > > Christina.Salls at noaa.gov > > Voice Mail 734-741-2446 > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > -- > Christina A. Salls > GLERL Computer Group > help.glerl at noaa.gov > Help Desk x2127 > Christina.Salls at noaa.gov > Voice Mail 734-741-2446 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From cholam20 at yahoo.co.in Thu Feb 16 18:47:22 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Fri, 17 Feb 2012 07:17:22 +0530 (IST) Subject: [torqueusers] Hey How are you?. Message-ID: <1329443242.3982.androidMobile@web137305.mail.in.yahoo.com>

hey there!
its difficult to handle all the financial pressure this caught my eye a few weeks ago it seemed like I was cornered!
http://www.alival.it/lastnews/86MatthewAnderson/ everything worked out in my favor
try it out for yourself...

ttyl.

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/1083a0b1/attachment.html From dbeer at adaptivecomputing.com Fri Feb 17 13:10:41 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Fri, 17 Feb 2012 13:10:41 -0700 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: Doug, I have created a ticket for our documentation team to note that the TDK is where nvml.h can be found. We also thank you for the patch. I believe there is some more work that needs to be done beyond just this change, but we will look to get those done very soon. I think it would be ideal to allow people to use the same binary for both GPU enabled and non-GPU enabled nodes. David On Thu, Feb 16, 2012 at 1:49 PM, Doug Johnson wrote: > Axel, thanks for the clarification. David, can you update the > documentation to clarify that the Tesla Deployment Kit is needed to > for nvml.h? The TDK is not linked to from the normal CUDA download > pages, and are a bit obscure. > > However, when this option is enabled (at least in torque-2.5.10), > pbs_mom will immediately exit if the node does not have a gpu. > Clusters that have a mix of GPU and non-GPU nodes are common. Could > we do something like the following instead? > > --- mom_server.c~ 2012-01-12 16:34:39.000000000 -0500 > +++ mom_server.c 2012-02-16 14:51:17.480860518 -0500 > @@ -1255,7 +1255,7 @@ > > rc = nvmlInit(); > > - if (rc == NVML_SUCCESS) > + if (rc == NVML_SUCCESS || rc == NVML_ERROR_DRIVER_NOT_LOADED) > return (TRUE); > > log_nvml_error (rc, NULL, id); > > This would allow systems without GPUs to start the same mom as the GPU > nodes. Ideally the API would also have an error such as > NVML_ERROR_NO_DEVICE that would be returned if no nvidia devices > existed in the system (check for pci devices, don't rely on driver > initialization failure as that's ambiguous.) > > Doug > > > At Wed, 15 Feb 2012 18:56:36 -0500, > Axel Kohlmeyer wrote: > > > > On Wed, Feb 15, 2012 at 6:54 PM, Doug Johnson wrote: > > > Hi David, > > > > > > I was going to send a separate email about '--with-nvml-include' once > > > I had more time to look at the problem. It seems that nvml.h no > > > longer exists in the newer versions of the CUDA SDK. We have version > > > > http://developer.nvidia.com/nvidia-management-library-NVML > > > > axel. > > > > > 4.1.28 of both the gpucomputingsdk and cudatoolkit, there is no nvml.h > > > and enabling this option in torque results in failure to build. I > > > Haven't had a chance to take a look at older versions or the release > > > notes for descriptions of when this changed. > > > > > > Is it safe to assume that if we were able to use this code, a context > > > to the cards would be kept open by the mom? > > > > > > Doug > > > > > > At Wed, 15 Feb 2012 16:22:09 -0700, > > > David Beer wrote: > > >> > > >> [1 ] > > >> [1.1 ] > > >> > > >> [1.2 ] > > >> Doug, > > >> > > >> Have you tried using the --with-nvml-include= option in > configure? This has pbs_mom use the > > >> nvidia API for these calls, and should speed things up a bit. The > path should be the path to the nvml.h > > >> file and is usually: > > >> /usr/local/cuda/CUDAToolsSDK/NVML/ > > >> > > >> David > > >> > > >> On Wed, Feb 15, 2012 at 4:15 PM, Doug Johnson > wrote: > > >> > > >> Hi, > > >> > > >> Has anyone noticed the overhead when enabling GPU support in > torque? > > >> The nvidia-smi process requires about 4 cpu seconds for each > > >> invocation. When executing a non-GPU code that uses all the cores > > >> this results in a bit of oversubscription of the cores. Since > > >> nvidia-smi is executed every 30 seconds to collect card state this > > >> results in a measurable decrease in performance. > > >> > > >> As a workaround I've enabled 'persistence mode' for the card. > When > > >> not in use, the card is apparently not initialized. With > persistence > > >> mode enabled the cpu time to execute the command is reduced to > ~0.02. > > >> This will also help with the execution time of short kernels, as > the > > >> card will be ready to go. > > >> > > >> Do other people run with persistence mode enabled? Are there any > > >> downsides? > > >> > > >> Doug > > >> > > >> PS. I think if X were running this would not be an issue. > > >> _______________________________________________ > > >> torqueusers mailing list > > >> torqueusers at supercluster.org > > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > >> > > >> -- > > >> David Beer | Software Engineer > > >> Adaptive Computing > > >> > > >> > > >> [2 ] > > >> _______________________________________________ > > >> torqueusers mailing list > > >> torqueusers at supercluster.org > > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > -- > > Dr. Axel Kohlmeyer akohlmey at gmail.com > > http://sites.google.com/site/akohlmey/ > > > > Institute for Computational Molecular Science > > Temple University, Philadelphia PA, USA. > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/6c9f3e05/attachment.html From christina.salls at noaa.gov Fri Feb 17 14:07:47 2012 From: christina.salls at noaa.gov (Christina Salls) Date: Fri, 17 Feb 2012 16:07:47 -0500 Subject: [torqueusers] Scheduler bound to ETHO IP port Message-ID: Hi all, I have been experiencing a problem with jobs staying in my default queue until I force execution with a qrun. It turns out that the reason is that my torque server is configured on my second ethernet interface which is connected to my compute nodes. The problem is that the scheduler is bound to the 1st interface port. [root at wings server_logs]# ps -ef | grep pbs root 1268 1 0 13:56 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain root 14768 1 0 14:25 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque root 21956 16623 0 14:41 pts/25 00:00:00 grep pbs [root at wings server_logs]# lsof -p 14768 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pbs_sched 14768 root cwd DIR 8,98 4096 6032970 /var/spool/torque/sched_priv pbs_sched 14768 root rtd DIR 8,98 4096 2 / pbs_sched 14768 root txt REG 8,98 268782 3421344 /usr/local/sbin/pbs_sched pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ ld-2.12.so pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ libc-2.12.so pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ libnss_files-2.12.so pbs_sched 14768 root mem REG 8,98 791107 3418524 /usr/local/lib/libtorque.so.2.0.0 pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null pbs_sched 14768 root 1w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 2w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 3w REG 8,98 2699 6033359 /var/spool/torque/sched_logs/20120217 pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP wings.glerl.noaa.gov:15004 (LISTEN) pbs_sched 14768 root 5wW REG 8,98 7 6033329 /var/spool/torque/sched_priv/sched.lock pbs_sched 14768 root 6r REG 8,98 4374 6032952 /var/spool/torque/sched_priv/resource_group pbs_sched 14768 root 7w REG 8,98 0 6033360 /var/spool/torque/sched_priv/accounting/20120217 [root at wings server_logs]# cd .. [root at wings torque]# ls aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs sched_priv server_logs server_name server_priv spool undelivered [root at wings torque]# lsof -n -p 14768 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pbs_sched 14768 root cwd DIR 8,98 4096 6032970 /var/spool/torque/sched_priv pbs_sched 14768 root rtd DIR 8,98 4096 2 / pbs_sched 14768 root txt REG 8,98 268782 3421344 /usr/local/sbin/pbs_sched pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ ld-2.12.so pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ libc-2.12.so pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ libnss_files-2.12.so pbs_sched 14768 root mem REG 8,98 791107 3418524 /usr/local/lib/libtorque.so.2.0.0 pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null pbs_sched 14768 root 1w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 2w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 3w REG 8,98 2699 6033359 /var/spool/torque/sched_logs/20120217 pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP 192.94.173.9:15004 (LISTEN) pbs_sched 14768 root 5wW REG 8,98 7 6033329 /var/spool/torque/sched_priv/sched.lock pbs_sched 14768 root 6r REG 8,98 4374 6032952 /var/spool/torque/sched_priv/resource_group pbs_sched 14768 root 7w REG 8,98 0 6033360 /var/spool/torque/sched_priv/accounting/20120217 [root at wings torque]# ls aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs sched_priv server_logs server_name server_priv spool undelivered [root at wings torque]# cd sched_priv [root at wings sched_priv]# ls accounting dedicated_time holidays resource_group sched_config sched.lock sched_out [root at wings sched_priv]# more sched_config When I used hostname to change the name to the admin.default.domain, and restarted the pbs_sched daemon, everything started working. Any idea how to change the hostname/IP/interface that the scheduler uses? Thanks, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/ec817a79/attachment-0001.html From djohnson at osc.edu Fri Feb 17 15:12:13 2012 From: djohnson at osc.edu (Doug Johnson) Date: Fri, 17 Feb 2012 17:12:13 -0500 Subject: [torqueusers] Performance of non-GPU codes on GPU nodes reduced by nvidia-smi overhead In-Reply-To: References: Message-ID: At Fri, 17 Feb 2012 13:10:41 -0700, David Beer wrote: > > [1 ] > [1.1 ] > > [1.2 ] > Doug, > > I have created a ticket for our documentation team to note that the TDK is where nvml.h can be found. > > We also thank you for the patch. I believe there is some more work that needs to be done beyond just > this change, but we will look to get those done very soon. I think it would be ideal to allow people to > use the same binary for both GPU enabled and non-GPU enabled nodes. > Yeah, conceptual versus actually working. There's no really proper way to do this with how the gpu code is currently inline with many ifdefs. It's a surprisingly small number of entry points that need to be modified. See the attached patch, this allows an NVML enabled build to run on either a GPU or non-GPU node. If all the GPU routines were moved into their own file this could be done cleanly and without a lot of effort. Doug PS. caveat emptor with the patch, I've run two jobs on the nodes so it's not exactly thought out. PPS. Should move thread to torque devel, sorry. -------------- next part -------------- A non-text attachment was scrubbed... Name: no_gpu.diff Type: application/octet-stream Size: 3426 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/6dcea759/attachment.obj -------------- next part -------------- From jjc at iastate.edu Mon Feb 20 09:43:21 2012 From: jjc at iastate.edu (Coyle, James J [ITACD]) Date: Mon, 20 Feb 2012 16:43:21 +0000 Subject: [torqueusers] Scheduler bound to ETHO IP port In-Reply-To: References: Message-ID: <242421BFAF465844BE24EB90BB97E2210197741C@ITSDAG1D.its.iastate.edu> Cristina, I think that it is common to use two interfaces on the login node, one inward facing on a private subnet and one outward facing, and place the internal interface name in /var/spool/torque/server_name . Make sure that What I always do is to use /etc/hosts and insert a line like: 172.16.10.1 loginnode admin admin.default.domain and copy /etc/host through the compute nodes. You will also want to make sure that files precedes dns in /etc/nsswitch.conf Then I can use the internal name. - Jim C. From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Christina Salls Sent: Friday, February 17, 2012 3:08 PM To: Torque Users Mailing List; Michael Saxon; Frank Indiviglio; Craig Tierney; help >> GLERL IT Help; Jeff Hanson; Brian Beagan; John Cardenas Subject: [torqueusers] Scheduler bound to ETHO IP port Hi all, I have been experiencing a problem with jobs staying in my default queue until I force execution with a qrun. It turns out that the reason is that my torque server is configured on my second ethernet interface which is connected to my compute nodes. The problem is that the scheduler is bound to the 1st interface port. [root at wings server_logs]# ps -ef | grep pbs root 1268 1 0 13:56 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain root 14768 1 0 14:25 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque root 21956 16623 0 14:41 pts/25 00:00:00 grep pbs [root at wings server_logs]# lsof -p 14768 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pbs_sched 14768 root cwd DIR 8,98 4096 6032970 /var/spool/torque/sched_priv pbs_sched 14768 root rtd DIR 8,98 4096 2 / pbs_sched 14768 root txt REG 8,98 268782 3421344 /usr/local/sbin/pbs_sched pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ld-2.12.so pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/libc-2.12.so pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/libnss_files-2.12.so pbs_sched 14768 root mem REG 8,98 791107 3418524 /usr/local/lib/libtorque.so.2.0.0 pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null pbs_sched 14768 root 1w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 2w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 3w REG 8,98 2699 6033359 /var/spool/torque/sched_logs/20120217 pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP wings.glerl.noaa.gov:15004 (LISTEN) pbs_sched 14768 root 5wW REG 8,98 7 6033329 /var/spool/torque/sched_priv/sched.lock pbs_sched 14768 root 6r REG 8,98 4374 6032952 /var/spool/torque/sched_priv/resource_group pbs_sched 14768 root 7w REG 8,98 0 6033360 /var/spool/torque/sched_priv/accounting/20120217 [root at wings server_logs]# cd .. [root at wings torque]# ls aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs sched_priv server_logs server_name server_priv spool undelivered [root at wings torque]# lsof -n -p 14768 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pbs_sched 14768 root cwd DIR 8,98 4096 6032970 /var/spool/torque/sched_priv pbs_sched 14768 root rtd DIR 8,98 4096 2 / pbs_sched 14768 root txt REG 8,98 268782 3421344 /usr/local/sbin/pbs_sched pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ld-2.12.so pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/libc-2.12.so pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/libnss_files-2.12.so pbs_sched 14768 root mem REG 8,98 791107 3418524 /usr/local/lib/libtorque.so.2.0.0 pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null pbs_sched 14768 root 1w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 2w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 3w REG 8,98 2699 6033359 /var/spool/torque/sched_logs/20120217 pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP 192.94.173.9:15004 (LISTEN) pbs_sched 14768 root 5wW REG 8,98 7 6033329 /var/spool/torque/sched_priv/sched.lock pbs_sched 14768 root 6r REG 8,98 4374 6032952 /var/spool/torque/sched_priv/resource_group pbs_sched 14768 root 7w REG 8,98 0 6033360 /var/spool/torque/sched_priv/accounting/20120217 [root at wings torque]# ls aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs sched_priv server_logs server_name server_priv spool undelivered [root at wings torque]# cd sched_priv [root at wings sched_priv]# ls accounting dedicated_time holidays resource_group sched_config sched.lock sched_out [root at wings sched_priv]# more sched_config When I used hostname to change the name to the admin.default.domain, and restarted the pbs_sched daemon, everything started working. Any idea how to change the hostname/IP/interface that the scheduler uses? Thanks, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120220/359255cd/attachment-0001.html From cholam20 at yahoo.co.in Tue Feb 21 05:05:39 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Tue, 21 Feb 2012 17:35:39 +0530 (IST) Subject: [torqueusers] Fwd: no regrets after doing this venture... Message-ID: <1329825939.33728.androidMobile@web137304.mail.in.yahoo.com>

whats up.
my parents were sick of lending me money all the time I was able to get back on my feet with this I needed a quick and easy solution!
http://institutofarina.com.br/breakingnews/72DarrenCarter/ now I can afford season tickets
you should consider trying it...
ttyl

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120221/171cc9b9/attachment.html From rmckay at adaptivecomputing.com Fri Feb 17 15:54:01 2012 From: rmckay at adaptivecomputing.com (Rick McKay) Date: Fri, 17 Feb 2012 15:54:01 -0700 (MST) Subject: [torqueusers] Scheduler bound to ETHO IP port In-Reply-To: Message-ID: <654f80c7-2933-404e-9674-935553843683@mail> Christina, I think you're looking for this: >From 2.5.9 CHANGELOG file: e - Added new option to torque.cfg name TRQ_IFNAME. This allows the user to designate a preferred outbound interface for TORQUE requests. The interface is the name of the NIC interface, for example eth0. Reference that parameter and QSUBHOST in Appendix K. --Rick Rick McKay | Technical Support Engineer rmckay at adaptivecomputing.com Direct: (801) 717-3395 | Toll free: 1-888-221-2008 x3395 Adaptive Computing | www.adaptivecomputing.com ----- Original Message ----- From: "Christina Salls" To: "Torque Users Mailing List" , "Michael Saxon" , "Frank Indiviglio" , "Craig Tierney" , "help >> GLERL IT Help" , "Jeff Hanson" , "Brian Beagan" , "John Cardenas" Sent: Friday, February 17, 2012 2:07:47 PM Subject: [torqueusers] Scheduler bound to ETHO IP port Hi all, I have been experiencing a problem with jobs staying in my default queue until I force execution with a qrun. It turns out that the reason is that my torque server is configured on my second ethernet interface which is connected to my compute nodes. The problem is that the scheduler is bound to the 1st interface port. [root at wings server_logs]# ps -ef | grep pbs root 1268 1 0 13:56 ? 00:00:00 /usr/local/sbin/pbs_server -d /var/spool/torque -H admin.default.domain root 14768 1 0 14:25 ? 00:00:00 /usr/local/sbin/pbs_sched -d /var/spool/torque root 21956 16623 0 14:41 pts/25 00:00:00 grep pbs [root at wings server_logs]# lsof -p 14768 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pbs_sched 14768 root cwd DIR 8,98 4096 6032970 /var/spool/torque/sched_priv pbs_sched 14768 root rtd DIR 8,98 4096 2 / pbs_sched 14768 root txt REG 8,98 268782 3421344 /usr/local/sbin/pbs_sched pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ ld-2.12.so pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ libc-2.12.so pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ libnss_files-2.12.so pbs_sched 14768 root mem REG 8,98 791107 3418524 /usr/local/lib/libtorque.so.2.0.0 pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null pbs_sched 14768 root 1w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 2w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 3w REG 8,98 2699 6033359 /var/spool/torque/sched_logs/20120217 pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP wings.glerl.noaa.gov:15004 (LISTEN) pbs_sched 14768 root 5wW REG 8,98 7 6033329 /var/spool/torque/sched_priv/sched.lock pbs_sched 14768 root 6r REG 8,98 4374 6032952 /var/spool/torque/sched_priv/resource_group pbs_sched 14768 root 7w REG 8,98 0 6033360 /var/spool/torque/sched_priv/accounting/20120217 [root at wings server_logs]# cd .. [root at wings torque]# ls aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs sched_priv server_logs server_name server_priv spool undelivered [root at wings torque]# lsof -n -p 14768 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pbs_sched 14768 root cwd DIR 8,98 4096 6032970 /var/spool/torque/sched_priv pbs_sched 14768 root rtd DIR 8,98 4096 2 / pbs_sched 14768 root txt REG 8,98 268782 3421344 /usr/local/sbin/pbs_sched pbs_sched 14768 root mem REG 8,98 156872 3276802 /lib64/ ld-2.12.so pbs_sched 14768 root mem REG 8,98 1979000 3276803 /lib64/ libc-2.12.so pbs_sched 14768 root mem REG 8,98 65928 3277205 /lib64/ libnss_files-2.12.so pbs_sched 14768 root mem REG 8,98 791107 3418524 /usr/local/lib/libtorque.so.2.0.0 pbs_sched 14768 root 0r CHR 1,3 0t0 3772 /dev/null pbs_sched 14768 root 1w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 2w REG 8,98 0 6033331 /var/spool/torque/sched_priv/sched_out pbs_sched 14768 root 3w REG 8,98 2699 6033359 /var/spool/torque/sched_logs/20120217 pbs_sched 14768 root 4u IPv4 801882953 0t0 TCP 192.94.173.9:15004 (LISTEN) pbs_sched 14768 root 5wW REG 8,98 7 6033329 /var/spool/torque/sched_priv/sched.lock pbs_sched 14768 root 6r REG 8,98 4374 6032952 /var/spool/torque/sched_priv/resource_group pbs_sched 14768 root 7w REG 8,98 0 6033360 /var/spool/torque/sched_priv/accounting/20120217 [root at wings torque]# ls aux checkpoint job_logs mom_logs mom_priv pbs_environment sched_logs sched_priv server_logs server_name server_priv spool undelivered [root at wings torque]# cd sched_priv [root at wings sched_priv]# ls accounting dedicated_time holidays resource_group sched_config sched.lock sched_out [root at wings sched_priv]# more sched_config When I used hostname to change the name to the admin.default.domain, and restarted the pbs_sched daemon, everything started working. Any idea how to change the hostname/IP/interface that the scheduler uses? Thanks, Christina -- Christina A. Salls GLERL Computer Group help.glerl at noaa.gov Help Desk x2127 Christina.Salls at noaa.gov Voice Mail 734-741-2446 _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120217/3aee8398/attachment.html From l.flis at cyf-kr.edu.pl Wed Feb 22 05:19:55 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Wed, 22 Feb 2012 13:19:55 +0100 Subject: [torqueusers] torque 2.5.10 - interactive jobs startup Message-ID: <4F44DD6B.3000209@cyf-kr.edu.pl> Hi, Torque: 2.5.10 On a busy clusters where many jobs share the same node we can observe that some of interactive jobs get interrupted during startup. From user point of view problem manifests itself with message: qsub: Job apparently deleted Corresponding log file from pbs_mom indicates Interrupted System call during read() on a socket in function rcvttype Feb 19 11:00:35 n6-2-32.local pbs_mom[]: LOG_ERROR::Interrupted system call (4) in TMomFinalizeChild, cannot get termtype It looks like read is interrupted by SIGCHLD or SIGALARM (pbs_mom definfes 5 second limit for rcvtermtype() to return, might be not long enough for busy systems I didn't have time to write the fix for it but is should be trivial Cheers -- Lukasz Flis From dbeer at adaptivecomputing.com Wed Feb 22 11:17:03 2012 From: dbeer at adaptivecomputing.com (David Beer) Date: Wed, 22 Feb 2012 11:17:03 -0700 Subject: [torqueusers] TORQUE 4.0.0 Update Message-ID: All, Thanks for all of the help with beta testing and input for assisting us in preparing TORQUE 4.0 for release. Just to keep everyone updated on the progress, we want to let you all know that QA has approved TORQUE and everyone should look for the official release announcement, together with the Moab HPC Suite 7.0, on March 13th. -- David Beer | Software Engineer Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120222/d84bb46c/attachment-0001.html From sm4082 at nyu.edu Wed Feb 22 11:22:16 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 22 Feb 2012 13:22:16 -0500 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. Message-ID: Hi, Recently, I have upgraded Torque to it's 2.5.10 version. Since then we have been seeing this error "PBS_NODEFILE: Undefined variable.". If we restart pbs mom then everything works fine. Does anyone have any idea what's causing this behavior? Please let me know if you need any information that could help in figuring out the problem. Thanks, Sreedhar. From knielson at adaptivecomputing.com Wed Feb 22 12:58:10 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Wed, 22 Feb 2012 12:58:10 -0700 (MST) Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: Message-ID: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> ----- Original Message ----- > From: "Sreedhar Manchu" > To: "Torque Users Mailing List" > Sent: Wednesday, February 22, 2012 11:22:16 AM > Subject: [torqueusers] PBS_NODEFILE: Undefined variable. > > Hi, > > Recently, I have upgraded Torque to it's 2.5.10 version. Since then > we have been seeing this error "PBS_NODEFILE: Undefined variable.". > If we restart pbs mom then everything works fine. Does anyone have > any idea what's causing this behavior? > > Please let me know if you need any information that could help in > figuring out the problem. > > Thanks, > Sreedhar. Sreedhar, Is the error showing up at the console or in the log file? Ken From noreply+2o8uij at twoomail.com Wed Feb 22 19:01:45 2012 From: noreply+2o8uij at twoomail.com (Dragonfly Lai =?UTF-8?B?w6nCgMKaw6jCv8KHIFR3b28=?=) Date: Thu, 23 Feb 2012 02:01:45 +0000 Subject: [torqueusers] =?utf-8?b?RHJhZ29uZmx5IExhaSDpgoDor7fkvaDliqDlhaUg?= =?utf-8?q?Twoo?= Message-ID: <78.8F.01953.55B954F4@mail03> ?????????????????????????????? Twoo .com? ---------------------------------------------------------------- ?????????????? ( http://twoo.com/m/EUCRXKRo )? Twoo ( http://twoo.com/m/EUCRXKGb ) Dragonfly Lai 31, Hong Kong MarkerPeople ( http://twoo.com/m/EUCRXKGb ) torqueusers ??? ???????????????????twoo.com ( http://twoo.com/m/ldc_bfGi )????????????? ??????? ( http://twoo.com/m/EUCRXKGb ) ?????????????????? ???? ( http://twoo.com/m/l6ALpmGw ). Massive Media NV, Emile Braunplein 18, 9000 Ghent, Belgiuminfo-zh at twoo.com ( mailto:info-zh at twoo.com ) BE0834322338. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120223/9380e019/attachment.html From sm4082 at nyu.edu Wed Feb 22 20:14:00 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 22 Feb 2012 22:14:00 -0500 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> References: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> Message-ID: <77FB6ECB-DFC6-482E-859F-0135D8B1CD1D@nyu.edu> Hi Ken, First, thank you for your response. We see the problem with mpi jobs. When people use PBS_NODEFILE variable to define the hosts for mpiexec to run we are seeing this error. It doesn't happen all the time. I tried to test some simple jobs to see whether I can see this variable on some nodes and it was ok. Problem is that it's happening randomly. For whatever reasons this variable PBS_NODEFILE is getting initialized or defined in the job environment. Whenever it happened we restarted pbs_moms and it was ok. At this point I'm not sure whether it's happening repeatedly on the same nodes. Now I'm waiting to see whether it happens on the same nodes again. We see this error in .err files whenever people try to access PBS_NODEFILE. If we restart pbs_mom once the job fails with this error, another job with the same script works fine with out any errors. For example, one user runs his job and at the end of his script he submits another job with this statement. ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh" We don't allow users to submit jobs from compute nodes. We have a submit filter in place on login nodes. This user pretty much submits the same job from the running job. Which means it ran well first time, but failed to do so because of above mentioned error. Another user gets exact error as in the subject when she tried to access PBS_NODEFILE as var=$PBS_NODEFILE. First user tries to run his job with this statement. /share/apps/openmpi/1.4.3/intel/bin/mpiexec \ -n 144 -hostfile $PBS_NODEFILE \ env OMP_NUM_THREADS=1 \ /home/mitgcmuv We see this error in the .err file. As you can see, since it couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and fails. -------------------------------------------------------------------------- Open RTE was unable to open the hostfile: env Check to make sure the path and filename are correct. -------------------------------------------------------------------------- Once we restarted the pbs_mom same script worked fine. I'm not sure what's causing this. I don't see anything wrong either in torque logs or syslogs. Today I made all nodes offline and plan to restart pbs_mom on all nodes hoping this would fix the issue forever, even though I doubt it might not. As we're not sure whether it is happening on the same nodes, restarting all nodes might give us an idea on this as well. Please let us know if you have any thoughts on what might be happening with our case. Thank you once again for your response and time. Regards, Sreedhar. On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote: > ----- Original Message ----- >> From: "Sreedhar Manchu" >> To: "Torque Users Mailing List" >> Sent: Wednesday, February 22, 2012 11:22:16 AM >> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. >> >> Hi, >> >> Recently, I have upgraded Torque to it's 2.5.10 version. Since then >> we have been seeing this error "PBS_NODEFILE: Undefined variable.". >> If we restart pbs mom then everything works fine. Does anyone have >> any idea what's causing this behavior? >> >> Please let me know if you need any information that could help in >> figuring out the problem. >> >> Thanks, >> Sreedhar. > > Sreedhar, > > Is the error showing up at the console or in the log file? > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers --- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 From sm4082 at nyu.edu Thu Feb 23 07:35:47 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 23 Feb 2012 09:35:47 -0500 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> References: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> Message-ID: Hi Ken, This morning I did few more test runs and found out it's #PBS -V in script that's causing problems. Whenever I include it I'm seeing this error. I think it's trying to look for this variable in user's exported environment rather than looking into variables defined by it self. Everything works fine if I take -V off the script. But the problem is some users need to mention this in their scripts to export not only user defined variables but also system variables such as LD_LIBRARY_PATH. Is there a way to fix it? Now I'll do some more tests to see we have same problem with other pbs variables such as PBS_O_WORKDIR etc. Please let me know if you have any suggestions. I will send another email once I do more testing. Thanks Sreedhar. -- Sent from my phone. Please excuse my brevity and any typos. On Feb 22, 2012, at 14:58, Ken Nielson wrote: > ----- Original Message ----- >> From: "Sreedhar Manchu" >> To: "Torque Users Mailing List" >> Sent: Wednesday, February 22, 2012 11:22:16 AM >> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. >> >> Hi, >> >> Recently, I have upgraded Torque to it's 2.5.10 version. Since then >> we have been seeing this error "PBS_NODEFILE: Undefined variable.". >> If we restart pbs mom then everything works fine. Does anyone have >> any idea what's causing this behavior? >> >> Please let me know if you need any information that could help in >> figuring out the problem. >> >> Thanks, >> Sreedhar. > > Sreedhar, > > Is the error showing up at the console or in the log file? > > Ken > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Thu Feb 23 10:02:42 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 23 Feb 2012 12:02:42 -0500 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: <77FB6ECB-DFC6-482E-859F-0135D8B1CD1D@nyu.edu> References: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> <77FB6ECB-DFC6-482E-859F-0135D8B1CD1D@nyu.edu> Message-ID: <07B91FF0-9FC6-491D-AB7E-DAFA8C429119@nyu.edu> Ken, I have restarted pbs_mom and now it works with #PBS -V in the script. Not sure what exactly is happening. Sreedhar. On Feb 22, 2012, at 10:14 PM, Sreedhar Manchu wrote: > Hi Ken, > > First, thank you for your response. We see the problem with mpi jobs. When people use PBS_NODEFILE variable to define the hosts for mpiexec to run we are seeing this error. It doesn't happen all the time. I tried to test some simple jobs to see whether I can see this variable on some nodes and it was ok. Problem is that it's happening randomly. For whatever reasons this variable PBS_NODEFILE is getting initialized or defined in the job environment. > > Whenever it happened we restarted pbs_moms and it was ok. At this point I'm not sure whether it's happening repeatedly on the same nodes. Now I'm waiting to see whether it happens on the same nodes again. > > We see this error in .err files whenever people try to access PBS_NODEFILE. If we restart pbs_mom once the job fails with this error, another job with the same script works fine with out any errors. > > For example, one user runs his job and at the end of his script he submits another job with this statement. > > ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh" > > We don't allow users to submit jobs from compute nodes. We have a submit filter in place on login nodes. This user pretty much submits the same job from the running job. Which means it ran well first time, but failed to do so because of above mentioned error. Another user gets exact error as in the subject when she tried to access PBS_NODEFILE as var=$PBS_NODEFILE. > > First user tries to run his job with this statement. > > /share/apps/openmpi/1.4.3/intel/bin/mpiexec \ > -n 144 -hostfile $PBS_NODEFILE \ > env OMP_NUM_THREADS=1 \ > /home/mitgcmuv > > > We see this error in the .err file. As you can see, since it couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and fails. > -------------------------------------------------------------------------- > Open RTE was unable to open the hostfile: > env > Check to make sure the path and filename are correct. > -------------------------------------------------------------------------- > > Once we restarted the pbs_mom same script worked fine. I'm not sure what's causing this. I don't see anything wrong either in torque logs or syslogs. > > Today I made all nodes offline and plan to restart pbs_mom on all nodes hoping this would fix the issue forever, even though I doubt it might not. As we're not sure whether it is happening on the same nodes, restarting all nodes might give us an idea on this as well. > > Please let us know if you have any thoughts on what might be happening with our case. > > Thank you once again for your response and time. > > Regards, > Sreedhar. > > On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote: > >> ----- Original Message ----- >>> From: "Sreedhar Manchu" >>> To: "Torque Users Mailing List" >>> Sent: Wednesday, February 22, 2012 11:22:16 AM >>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. >>> >>> Hi, >>> >>> Recently, I have upgraded Torque to it's 2.5.10 version. Since then >>> we have been seeing this error "PBS_NODEFILE: Undefined variable.". >>> If we restart pbs mom then everything works fine. Does anyone have >>> any idea what's causing this behavior? >>> >>> Please let me know if you need any information that could help in >>> figuring out the problem. >>> >>> Thanks, >>> Sreedhar. >> >> Sreedhar, >> >> Is the error showing up at the console or in the log file? >> >> Ken >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > > --- > Sreedhar Manchu > HPC Support Specialist > New York University > 251 Mercer Street > New York, NY 10012-1110 > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lance at quantumbioinc.com Thu Feb 23 10:05:11 2012 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Thu, 23 Feb 2012 12:05:11 -0500 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74E33@exvic-mbx04.nexus.csiro.au> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> <007DECE986B47F4EABF823C1FBB19C620102CDD74E33@exvic-mbx04.nexus.csiro.au> Message-ID: <15C56CC3-B801-4000-96A6-B778EF0D9384@quantumbioinc.com> Hi Gareth- I've tried with and without the JOBNODEMATCHPOLICY set, and still no luck. Whenever I try the following, I get the following error. That is even though I have 144 processors in the cluster. So somehow torque/maui is seeing my nodes as collections of 8 processors. $ qsub -I -lnodes=46 qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes I found that someone else that has set up their user documentation to show precisely the setting that I want, so it sounds like it can be done. Specifically, as noted on the following link, "#PBS -l nodes=1 asks PBS for one CPU. This means that when your jobs starts you will have exclusive access to one CPU. But if you want something like 4 nodes each with exactly 2 CPUs (total of 8 CPUs), then you would use something like -l nodes=4:ppn=2. Instead, if you just want any 8 CPUs in the cluster, you would request like just -l nodes=8." http://bmi.cchmc.org/resources/software/torque/examples But I can't find a lot in the email archive to help. Part of the problem is finding the right terms to look under though, so I'll keep searching. Thanks for any further information that can be provided. -Lance On Feb 14, 2012, at 6:27 PM, wrote: >> -----Original Message----- >> From: Lance Westerhoff [mailto:lance at quantumbioinc.com] >> Sent: Wednesday, 15 February 2012 5:12 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] procs= not working as documented (or >> understood?) >> >> >> Hello All- >> >> We're still having trouble with this feature, and we are starting to >> shop around for a torque/maui replacement in order to be able to use >> it. Before we do that however, I wanted to see if anyone has any >> thoughts on how to address the problem within torque/maui. Perhaps I >> simply don't understand the feature. The versions of torque and maui we >> are using are. >> >> torque-3.0.2 >> maui-3.2.6p21 >> >> Yes, we have tried newer versions of maui, but then the option doesn't >> work at all. >> >> Here is the scenario (I also included the conversation from November >> below for more information). >> >> Conceptually, our software is almost infinitely scalable in the sense >> that there is very little overhead associated with interprocess >> communication. Therefore, we do not require that all of the processes >> reside on a small number of nodes. In fact, we can stretch the >> processors to any and all nodes in the cluster with ~zero loss in >> performance. So we can literally have one node that has a single >> process running and another node that has 8 processes running. Since we >> have that level of scalability, we don't want to have to lock ourselves >> into having to request resources using the "nodes=X:ppn=Y" style since >> this style requires that nodes open up or drain in order to use them. >> Since our users have a big mixture of single and multi-processor jobs, >> waiting for node drain can really waste a lot of resources. >> >> I saw the "procs=#" the Requesting Resources table (see >> http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resou >> rces for more). It *appears* that this option should be able to allow >> the user to request simply X*Y processors and the scheduler should be >> able to schedule them any way it can fit. So using the following #PBS >> note, we should be able to request 40 processors: >> >> #PBS -l procs=40 >> >> Instead, we see that the scheduler seems to take this information, read >> it, and basically disregard it. The reason I know it reads it is >> because if I ask for say 40 processors and 40 processors are available >> in the cluster, it works as expected and all is right with the world. >> Where it gets a bit more choppy is when I ask for 40 processors and >> only 1 processor is available. The job doesn't wait in the queue for >> the remaining 39 processors to open up, and instead PBS simply just >> starts the job on that processor. I can't see how that is anything but >> a bug. If the user is asking for 40 processors, why isn't the scheduler >> waiting for all 40 processors to open up? >> >> I'll also post this to the maui list so I apologize if you receive it >> twice. I'm just not sure if this is a problem with torque, maui, or >> both. If answering this question will require additional information, >> please ask. We are at our wits end here. >> >> Thanks! >> >> -Lance > > Hi Lance, > > It is more-or-less equivalent to request -l nodes=40 and -l procs=40 _if_ > you are using maui _and_ you don't set JOBNODEMATCHPOLICY to EXACTNODE > > You may need to 'fake' having a large number of nodes to make this work. > > There are old mailing list items describing such a setup and how to 'fake' > the number of nodes. We've never actually done this at our site so I'm > unsure on details. I don't like it :-) but it may suit/help you. > > Gareth > > >> >> >> >> >> On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: >> >>> >>> Hi Steve- >>> >>> Here you go. Here is the top few lines of the job script. I have then >> provided the output you requested long with the maui.cfg. If you need >> anything further, certainly please let me know. >>> >>> Thanks for your help! >>> >>> =============== >>> >>> + head job.pbs >>> >>> #!/bin/bash >>> #PBS -S /bin/bash >>> #PBS -l procs=100 >>> #PBS -l pmem=700mb >>> #PBS -l walltime=744:00:00 >>> #PBS -j oe >>> #PBS -q batch >>> >>> Report run on Fri Nov 18 10:49:38 EST 2011 >>> + pbsnodes --version >>> version: 3.0.2 >>> + diagnose --version >>> maui client version 3.2.6p21 >>> + checkjob 371010 >>> >>> >>> checking job 371010 >>> >>> State: Running >>> Creds: user:josh group:games class:batch qos:DEFAULT >>> WallTime: 00:02:35 of 31:00:00:00 >>> SubmitTime: Fri Nov 18 10:46:33 >>> (Time Queued Total: 00:00:01 Eligible: 00:00:01) >>> >>> StartTime: Fri Nov 18 10:46:34 >>> Total Tasks: 1 >>> >>> Req[0] TaskCount: 26 Partition: DEFAULT >>> Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>> Dedicated Resources Per Task: PROCS: 1 MEM: 700M >>> NodeCount: 10 >>> Allocated Nodes: >>> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] >>> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] >>> [compute-0-13:2][compute-0-14:2] >>> >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 0 StartCount: 1 >>> PartitionMask: [ALL] >>> Flags: RESTARTABLE >>> >>> Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: >> 31:00:00:00) >>> PE: 26.00 StartPriority: 4716 >>> >>> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" >>> SERVERHOST gondor >>> ADMIN1 maui root >>> ADMIN3 ALL >>> RMCFG[base] TYPE=PBS >>> AMCFG[bank] TYPE=NONE >>> RMPOLLINTERVAL 00:01:00 >>> SERVERPORT 42559 >>> SERVERMODE NORMAL >>> LOGFILE maui.log >>> LOGFILEMAXSIZE 10000000 >>> LOGLEVEL 3 >>> QUEUETIMEWEIGHT 1 >>> FSPOLICY DEDICATEDPS >>> FSDEPTH 7 >>> FSINTERVAL 86400 >>> FSDECAY 0.50 >>> FSWEIGHT 200 >>> FSUSERWEIGHT 1 >>> FSGROUPWEIGHT 1000 >>> FSQOSWEIGHT 1000 >>> FSACCOUNTWEIGHT 1 >>> FSCLASSWEIGHT 1000 >>> USERWEIGHT 4 >>> BACKFILLPOLICY FIRSTFIT >>> RESERVATIONPOLICY CURRENTHIGHEST >>> NODEALLOCATIONPOLICY MINRESOURCE >>> RESERVATIONDEPTH 8 >>> MAXJOBPERUSERPOLICY OFF >>> MAXJOBPERUSERCOUNT 8 >>> MAXPROCPERUSERPOLICY OFF >>> MAXPROCPERUSERCOUNT 256 >>> MAXPROCSECONDPERUSERPOLICY OFF >>> MAXPROCSECONDPERUSERCOUNT 36864000 >>> MAXJOBQUEUEDPERUSERPOLICY OFF >>> MAXJOBQUEUEDPERUSERCOUNT 2 >>> JOBNODEMATCHPOLICY EXACTNODE >>> NODEACCESSPOLICY SHARED >>> JOBMAXOVERRUN 99:00:00:00 >>> DEFERCOUNT 8192 >>> DEFERTIME 0 >>> CLASSCFG[developer] FSTARGET=40.00+ >>> CLASSCFG[lowprio] PRIORITY=-1000 >>> SRCFG[developer] CLASSLIST=developer >>> SRCFG[developer] ACCESS=dedicated >>> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri >>> SRCFG[developer] STARTTIME=08:00:00 >>> SRCFG[developer] ENDTIME=18:00:00 >>> SRCFG[developer] TIMELIMIT=2:00:00 >>> SRCFG[developer] RESOURCES=PROCS(8) >>> USERCFG[DEFAULT] FSTARGET=100.0 >>> >>> =============== >>> >>> -Lance >>> >>> >>> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> >>>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >>>> >>>>> The request that is placed is for procs=60. Both torque and maui >> see that there are only 53 processors available and instead of letting >> the job sit in the queue and wait for all 60 processors to become >> available, it goes ahead and runs the job with what's available. Now if >> the user could ask for procs=[50-60] where 50 is the minimum number of >> processors to provide and 60 is the maximum, this would be a feature. >> But as it stands, if the user asks for 60 processors and ends up with 2 >> processors, the job just won't scale properly and he may as well kill >> it (when it shouldn't have run anyway). >>>> >>>> Hi Lance, >>>> >>>> Can you post the output of checkjob of an incorrectly >> running job. Let's take a look at what Maui thinks the job is asking >> for. >>>> >>>> Might as well add your maui.cfg file also. >>>> >>>> I've found in the past that procs= is troublesome... >>>> >>>>> >>>>> I'm actually beginning to think the problem may be related to maui. >> Perhaps I'll post this same question to the maui list and see what >> comes back. >>>>> >>>>> This problem is infuriating though since without the functionality >> working as it should, using procs=X in torque/maui makes torque/maui >> work more like a submission and run system (not a queuing system). >>>> >>>> Agreed. HPC cluster job management is normally be set it and forget >> it. Anything else other than maintenance/break fixes/new features would >> be ridiculously time consuming. >>>> >>>>> >>>>> -Lance >>>>> >>>>> >>>>>> >>>>>> Message: 3 >>>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>>>> From: "Brock Palen" >>>>>> Subject: Re: [torqueusers] procs= not working as documented >>>>>> To: "Torque Users Mailing List" >>>>>> Message-ID: >> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> >>>>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>>>> >>>>>> >>>>>> >>>>>> Brock Palen >>>>>> (734)936-1985 >>>>>> brockp at umich.edu >>>>>> - Sent from my Palm Pre, please excuse typos >>>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff >> <lance at quantumbioinc.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hello All- >>>>>> >>>>>> >>>>>> >>>>>> It appears that when running with the following specs, the procs= >> option does not actually work as expected. >>>>>> >>>>>> >>>>>> >>>>>> ========================================== >>>>>> >>>>>> >>>>>> >>>>>> #PBS -S /bin/bash >>>>>> >>>>>> #PBS -l procs=60 >>>>>> >>>>>> #PBS -l pmem=700mb >>>>>> >>>>>> #PBS -l walltime=744:00:00 >>>>>> >>>>>> #PBS -j oe >>>>>> >>>>>> #PBS -q batch >>>>>> >>>>>> >>>>>> >>>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option >> worked as documented >>>>>> >>>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete >> fail in terms of the procs option and it only asks for a single CPU) >>>>>> >>>>>> >>>>>> >>>>>> ========================================== >>>>>> >>>>>> >>>>>> >>>>>> If there are fewer then 60 processors available in the cluster (in >> this case there were 53 available) the job will go in an take whatever >> is left instead of waiting for all 60 processors to free up. Any >> thoughts as to why this might be happening? Sometimes it doesn't really >> matter and 53 would be almost as good as 60, however if only 2 >> processors are available and the user asks for 60, I would hate for him >> to go in. >>>>>> >>>>>> >>>>>> >>>>>> Thank you for your time! >>>>>> >>>>>> >>>>>> >>>>>> -Lance >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> ---------------------- >>>> Steve Crusan >>>> System Administrator >>>> Center for Research Computing >>>> University of Rochester >>>> https://www.crc.rochester.edu/ >>>> >>>> >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >>>> Comment: GPGTools - http://gpgtools.org >>>> >>>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >>>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >>>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >>>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >>>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >>>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >>>> =AGW7 >>>> -----END PGP SIGNATURE----- >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From lance at quantumbioinc.com Thu Feb 23 10:26:24 2012 From: lance at quantumbioinc.com (Lance Westerhoff) Date: Thu, 23 Feb 2012 12:26:24 -0500 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <15C56CC3-B801-4000-96A6-B778EF0D9384@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> <007DECE986B47F4EABF823C1FBB19C620102CDD74E33@exvic-mbx04.nexus.csiro.au> <15C56CC3-B801-4000-96A6-B778EF0D9384@quantumbioinc.com> Message-ID: <5E24CF12-A4DE-4064-A37B-741E495FE705@quantumbioinc.com> Hi All- Ok, I think I have it. The problem was the setting (or lack thereof) of the server resources_available.nodect variable. I set it to the following, and things seem to be working as expected: set server resources_available.nodect=144 (where 144 is the number of processors in my cluster) I have also commented out the following line in maui.cfg. # JOBNODEMATCHPOLICY EXACTNODE Thanks to everyone for their help. Hopefully this problem is officially behind us! -Lance On Feb 23, 2012, at 12:05 PM, Lance Westerhoff wrote: > > Hi Gareth- > > I've tried with and without the JOBNODEMATCHPOLICY set, and still no luck. > > Whenever I try the following, I get the following error. That is even though I have 144 processors in the cluster. So somehow torque/maui is seeing my nodes as collections of 8 processors. > > $ qsub -I -lnodes=46 > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes > > I found that someone else that has set up their user documentation to show precisely the setting that I want, so it sounds like it can be done. Specifically, as noted on the following link, "#PBS -l nodes=1 asks PBS for one CPU. This means that when your jobs starts you will have exclusive access to one CPU. But if you want something like 4 nodes each with exactly 2 CPUs (total of 8 CPUs), then you would use something like -l nodes=4:ppn=2. Instead, if you just want any 8 CPUs in the cluster, you would request like just -l nodes=8." http://bmi.cchmc.org/resources/software/torque/examples > > But I can't find a lot in the email archive to help. Part of the problem is finding the right terms to look under though, so I'll keep searching. > > Thanks for any further information that can be provided. > > -Lance > > > On Feb 14, 2012, at 6:27 PM, wrote: > >>> -----Original Message----- >>> From: Lance Westerhoff [mailto:lance at quantumbioinc.com] >>> Sent: Wednesday, 15 February 2012 5:12 AM >>> To: Torque Users Mailing List >>> Subject: Re: [torqueusers] procs= not working as documented (or >>> understood?) >>> >>> >>> Hello All- >>> >>> We're still having trouble with this feature, and we are starting to >>> shop around for a torque/maui replacement in order to be able to use >>> it. Before we do that however, I wanted to see if anyone has any >>> thoughts on how to address the problem within torque/maui. Perhaps I >>> simply don't understand the feature. The versions of torque and maui we >>> are using are. >>> >>> torque-3.0.2 >>> maui-3.2.6p21 >>> >>> Yes, we have tried newer versions of maui, but then the option doesn't >>> work at all. >>> >>> Here is the scenario (I also included the conversation from November >>> below for more information). >>> >>> Conceptually, our software is almost infinitely scalable in the sense >>> that there is very little overhead associated with interprocess >>> communication. Therefore, we do not require that all of the processes >>> reside on a small number of nodes. In fact, we can stretch the >>> processors to any and all nodes in the cluster with ~zero loss in >>> performance. So we can literally have one node that has a single >>> process running and another node that has 8 processes running. Since we >>> have that level of scalability, we don't want to have to lock ourselves >>> into having to request resources using the "nodes=X:ppn=Y" style since >>> this style requires that nodes open up or drain in order to use them. >>> Since our users have a big mixture of single and multi-processor jobs, >>> waiting for node drain can really waste a lot of resources. >>> >>> I saw the "procs=#" the Requesting Resources table (see >>> http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resou >>> rces for more). It *appears* that this option should be able to allow >>> the user to request simply X*Y processors and the scheduler should be >>> able to schedule them any way it can fit. So using the following #PBS >>> note, we should be able to request 40 processors: >>> >>> #PBS -l procs=40 >>> >>> Instead, we see that the scheduler seems to take this information, read >>> it, and basically disregard it. The reason I know it reads it is >>> because if I ask for say 40 processors and 40 processors are available >>> in the cluster, it works as expected and all is right with the world. >>> Where it gets a bit more choppy is when I ask for 40 processors and >>> only 1 processor is available. The job doesn't wait in the queue for >>> the remaining 39 processors to open up, and instead PBS simply just >>> starts the job on that processor. I can't see how that is anything but >>> a bug. If the user is asking for 40 processors, why isn't the scheduler >>> waiting for all 40 processors to open up? >>> >>> I'll also post this to the maui list so I apologize if you receive it >>> twice. I'm just not sure if this is a problem with torque, maui, or >>> both. If answering this question will require additional information, >>> please ask. We are at our wits end here. >>> >>> Thanks! >>> >>> -Lance >> >> Hi Lance, >> >> It is more-or-less equivalent to request -l nodes=40 and -l procs=40 _if_ >> you are using maui _and_ you don't set JOBNODEMATCHPOLICY to EXACTNODE >> >> You may need to 'fake' having a large number of nodes to make this work. >> >> There are old mailing list items describing such a setup and how to 'fake' >> the number of nodes. We've never actually done this at our site so I'm >> unsure on details. I don't like it :-) but it may suit/help you. >> >> Gareth >> >> >>> >>> >>> >>> >>> On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: >>> >>>> >>>> Hi Steve- >>>> >>>> Here you go. Here is the top few lines of the job script. I have then >>> provided the output you requested long with the maui.cfg. If you need >>> anything further, certainly please let me know. >>>> >>>> Thanks for your help! >>>> >>>> =============== >>>> >>>> + head job.pbs >>>> >>>> #!/bin/bash >>>> #PBS -S /bin/bash >>>> #PBS -l procs=100 >>>> #PBS -l pmem=700mb >>>> #PBS -l walltime=744:00:00 >>>> #PBS -j oe >>>> #PBS -q batch >>>> >>>> Report run on Fri Nov 18 10:49:38 EST 2011 >>>> + pbsnodes --version >>>> version: 3.0.2 >>>> + diagnose --version >>>> maui client version 3.2.6p21 >>>> + checkjob 371010 >>>> >>>> >>>> checking job 371010 >>>> >>>> State: Running >>>> Creds: user:josh group:games class:batch qos:DEFAULT >>>> WallTime: 00:02:35 of 31:00:00:00 >>>> SubmitTime: Fri Nov 18 10:46:33 >>>> (Time Queued Total: 00:00:01 Eligible: 00:00:01) >>>> >>>> StartTime: Fri Nov 18 10:46:34 >>>> Total Tasks: 1 >>>> >>>> Req[0] TaskCount: 26 Partition: DEFAULT >>>> Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 >>>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>> Dedicated Resources Per Task: PROCS: 1 MEM: 700M >>>> NodeCount: 10 >>>> Allocated Nodes: >>>> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] >>>> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] >>>> [compute-0-13:2][compute-0-14:2] >>>> >>>> >>>> IWD: [NONE] Executable: [NONE] >>>> Bypass: 0 StartCount: 1 >>>> PartitionMask: [ALL] >>>> Flags: RESTARTABLE >>>> >>>> Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: >>> 31:00:00:00) >>>> PE: 26.00 StartPriority: 4716 >>>> >>>> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" >>>> SERVERHOST gondor >>>> ADMIN1 maui root >>>> ADMIN3 ALL >>>> RMCFG[base] TYPE=PBS >>>> AMCFG[bank] TYPE=NONE >>>> RMPOLLINTERVAL 00:01:00 >>>> SERVERPORT 42559 >>>> SERVERMODE NORMAL >>>> LOGFILE maui.log >>>> LOGFILEMAXSIZE 10000000 >>>> LOGLEVEL 3 >>>> QUEUETIMEWEIGHT 1 >>>> FSPOLICY DEDICATEDPS >>>> FSDEPTH 7 >>>> FSINTERVAL 86400 >>>> FSDECAY 0.50 >>>> FSWEIGHT 200 >>>> FSUSERWEIGHT 1 >>>> FSGROUPWEIGHT 1000 >>>> FSQOSWEIGHT 1000 >>>> FSACCOUNTWEIGHT 1 >>>> FSCLASSWEIGHT 1000 >>>> USERWEIGHT 4 >>>> BACKFILLPOLICY FIRSTFIT >>>> RESERVATIONPOLICY CURRENTHIGHEST >>>> NODEALLOCATIONPOLICY MINRESOURCE >>>> RESERVATIONDEPTH 8 >>>> MAXJOBPERUSERPOLICY OFF >>>> MAXJOBPERUSERCOUNT 8 >>>> MAXPROCPERUSERPOLICY OFF >>>> MAXPROCPERUSERCOUNT 256 >>>> MAXPROCSECONDPERUSERPOLICY OFF >>>> MAXPROCSECONDPERUSERCOUNT 36864000 >>>> MAXJOBQUEUEDPERUSERPOLICY OFF >>>> MAXJOBQUEUEDPERUSERCOUNT 2 >>>> JOBNODEMATCHPOLICY EXACTNODE >>>> NODEACCESSPOLICY SHARED >>>> JOBMAXOVERRUN 99:00:00:00 >>>> DEFERCOUNT 8192 >>>> DEFERTIME 0 >>>> CLASSCFG[developer] FSTARGET=40.00+ >>>> CLASSCFG[lowprio] PRIORITY=-1000 >>>> SRCFG[developer] CLASSLIST=developer >>>> SRCFG[developer] ACCESS=dedicated >>>> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri >>>> SRCFG[developer] STARTTIME=08:00:00 >>>> SRCFG[developer] ENDTIME=18:00:00 >>>> SRCFG[developer] TIMELIMIT=2:00:00 >>>> SRCFG[developer] RESOURCES=PROCS(8) >>>> USERCFG[DEFAULT] FSTARGET=100.0 >>>> >>>> =============== >>>> >>>> -Lance >>>> >>>> >>>> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: >>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> >>>>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: >>>>> >>>>>> The request that is placed is for procs=60. Both torque and maui >>> see that there are only 53 processors available and instead of letting >>> the job sit in the queue and wait for all 60 processors to become >>> available, it goes ahead and runs the job with what's available. Now if >>> the user could ask for procs=[50-60] where 50 is the minimum number of >>> processors to provide and 60 is the maximum, this would be a feature. >>> But as it stands, if the user asks for 60 processors and ends up with 2 >>> processors, the job just won't scale properly and he may as well kill >>> it (when it shouldn't have run anyway). >>>>> >>>>> Hi Lance, >>>>> >>>>> Can you post the output of checkjob of an incorrectly >>> running job. Let's take a look at what Maui thinks the job is asking >>> for. >>>>> >>>>> Might as well add your maui.cfg file also. >>>>> >>>>> I've found in the past that procs= is troublesome... >>>>> >>>>>> >>>>>> I'm actually beginning to think the problem may be related to maui. >>> Perhaps I'll post this same question to the maui list and see what >>> comes back. >>>>>> >>>>>> This problem is infuriating though since without the functionality >>> working as it should, using procs=X in torque/maui makes torque/maui >>> work more like a submission and run system (not a queuing system). >>>>> >>>>> Agreed. HPC cluster job management is normally be set it and forget >>> it. Anything else other than maintenance/break fixes/new features would >>> be ridiculously time consuming. >>>>> >>>>>> >>>>>> -Lance >>>>>> >>>>>> >>>>>>> >>>>>>> Message: 3 >>>>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 >>>>>>> From: "Brock Palen" >>>>>>> Subject: Re: [torqueusers] procs= not working as documented >>>>>>> To: "Torque Users Mailing List" >>>>>>> Message-ID: >>> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> >>>>>>> Content-Type: text/plain; charset="utf-8" >>>>>>> >>>>>>> Does maui only see one cpu or does mpiexec only see one cpu? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Brock Palen >>>>>>> (734)936-1985 >>>>>>> brockp at umich.edu >>>>>>> - Sent from my Palm Pre, please excuse typos >>>>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff >>> <lance at quantumbioinc.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello All- >>>>>>> >>>>>>> >>>>>>> >>>>>>> It appears that when running with the following specs, the procs= >>> option does not actually work as expected. >>>>>>> >>>>>>> >>>>>>> >>>>>>> ========================================== >>>>>>> >>>>>>> >>>>>>> >>>>>>> #PBS -S /bin/bash >>>>>>> >>>>>>> #PBS -l procs=60 >>>>>>> >>>>>>> #PBS -l pmem=700mb >>>>>>> >>>>>>> #PBS -l walltime=744:00:00 >>>>>>> >>>>>>> #PBS -j oe >>>>>>> >>>>>>> #PBS -q batch >>>>>>> >>>>>>> >>>>>>> >>>>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs option >>> worked as documented >>>>>>> >>>>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete >>> fail in terms of the procs option and it only asks for a single CPU) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ========================================== >>>>>>> >>>>>>> >>>>>>> >>>>>>> If there are fewer then 60 processors available in the cluster (in >>> this case there were 53 available) the job will go in an take whatever >>> is left instead of waiting for all 60 processors to free up. Any >>> thoughts as to why this might be happening? Sometimes it doesn't really >>> matter and 53 would be almost as good as 60, however if only 2 >>> processors are available and the user asks for 60, I would hate for him >>> to go in. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thank you for your time! >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Lance >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> torqueusers at supercluster.org >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> ---------------------- >>>>> Steve Crusan >>>>> System Administrator >>>>> Center for Research Computing >>>>> University of Rochester >>>>> https://www.crc.rochester.edu/ >>>>> >>>>> >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) >>>>> Comment: GPGTools - http://gpgtools.org >>>>> >>>>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ >>>>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 >>>>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ >>>>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 >>>>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv >>>>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= >>>>> =AGW7 >>>>> -----END PGP SIGNATURE----- >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> torqueusers at supercluster.org >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > From knielson at adaptivecomputing.com Thu Feb 23 10:28:21 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 23 Feb 2012 10:28:21 -0700 (MST) Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: <07B91FF0-9FC6-491D-AB7E-DAFA8C429119@nyu.edu> Message-ID: Sreedhar, I'm glad it is working again. At least we know that the MOM gets into some state that is cleared up by a restart. Let us know if it happens again. Ken ----- Original Message ----- > From: "Sreedhar Manchu" > To: "Torque Users Mailing List" > Sent: Thursday, February 23, 2012 10:02:42 AM > Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable. > > Ken, > > I have restarted pbs_mom and now it works with #PBS -V in the script. > Not sure what exactly is happening. > > Sreedhar. > > On Feb 22, 2012, at 10:14 PM, Sreedhar Manchu wrote: > > > Hi Ken, > > > > First, thank you for your response. We see the problem with mpi > > jobs. When people use PBS_NODEFILE variable to define the hosts > > for mpiexec to run we are seeing this error. It doesn't happen all > > the time. I tried to test some simple jobs to see whether I can > > see this variable on some nodes and it was ok. Problem is that > > it's happening randomly. For whatever reasons this variable > > PBS_NODEFILE is getting initialized or defined in the job > > environment. > > > > Whenever it happened we restarted pbs_moms and it was ok. At this > > point I'm not sure whether it's happening repeatedly on the same > > nodes. Now I'm waiting to see whether it happens on the same nodes > > again. > > > > We see this error in .err files whenever people try to access > > PBS_NODEFILE. If we restart pbs_mom once the job fails with this > > error, another job with the same script works fine with out any > > errors. > > > > For example, one user runs his job and at the end of his script he > > submits another job with this statement. > > > > ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh" > > > > We don't allow users to submit jobs from compute nodes. We have a > > submit filter in place on login nodes. This user pretty much > > submits the same job from the running job. Which means it ran well > > first time, but failed to do so because of above mentioned error. > > Another user gets exact error as in the subject when she tried to > > access PBS_NODEFILE as var=$PBS_NODEFILE. > > > > First user tries to run his job with this statement. > > > > /share/apps/openmpi/1.4.3/intel/bin/mpiexec \ > > -n 144 -hostfile $PBS_NODEFILE \ > > env OMP_NUM_THREADS=1 \ > > /home/mitgcmuv > > > > > > We see this error in the .err file. As you can see, since it > > couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and > > fails. > > -------------------------------------------------------------------------- > > Open RTE was unable to open the hostfile: > > env > > Check to make sure the path and filename are correct. > > -------------------------------------------------------------------------- > > > > Once we restarted the pbs_mom same script worked fine. I'm not sure > > what's causing this. I don't see anything wrong either in torque > > logs or syslogs. > > > > Today I made all nodes offline and plan to restart pbs_mom on all > > nodes hoping this would fix the issue forever, even though I doubt > > it might not. As we're not sure whether it is happening on the > > same nodes, restarting all nodes might give us an idea on this as > > well. > > > > Please let us know if you have any thoughts on what might be > > happening with our case. > > > > Thank you once again for your response and time. > > > > Regards, > > Sreedhar. > > > > On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote: > > > >> ----- Original Message ----- > >>> From: "Sreedhar Manchu" > >>> To: "Torque Users Mailing List" > >>> Sent: Wednesday, February 22, 2012 11:22:16 AM > >>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. > >>> > >>> Hi, > >>> > >>> Recently, I have upgraded Torque to it's 2.5.10 version. Since > >>> then > >>> we have been seeing this error "PBS_NODEFILE: Undefined > >>> variable.". > >>> If we restart pbs mom then everything works fine. Does anyone > >>> have > >>> any idea what's causing this behavior? > >>> > >>> Please let me know if you need any information that could help in > >>> figuring out the problem. > >>> > >>> Thanks, > >>> Sreedhar. > >> > >> Sreedhar, > >> > >> Is the error showing up at the console or in the log file? > >> > >> Ken > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > > --- > > Sreedhar Manchu > > HPC Support Specialist > > New York University > > 251 Mercer Street > > New York, NY 10012-1110 > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From Gareth.Williams at csiro.au Thu Feb 23 13:51:09 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 24 Feb 2012 07:51:09 +1100 Subject: [torqueusers] procs= not working as documented (or understood?) In-Reply-To: <5E24CF12-A4DE-4064-A37B-741E495FE705@quantumbioinc.com> References: <1932F66F-B18D-45F0-9BFE-E99EB7613BDE@quantumbioinc.com> <48B32496-0D68-4355-8FA8-65A71CFEB7BB@quantumbioinc.com> <007DECE986B47F4EABF823C1FBB19C620102CDD74E33@exvic-mbx04.nexus.csiro.au> <15C56CC3-B801-4000-96A6-B778EF0D9384@quantumbioinc.com> <5E24CF12-A4DE-4064-A37B-741E495FE705@quantumbioinc.com> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E8D@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Lance Westerhoff [mailto:lance at quantumbioinc.com] > Sent: Friday, 24 February 2012 4:26 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] procs= not working as documented (or > understood?) > > > Hi All- > > Ok, I think I have it. The problem was the setting (or lack thereof) of > the server resources_available.nodect variable. > > I set it to the following, and things seem to be working as expected: > > set server resources_available.nodect=144 > > (where 144 is the number of processors in my cluster) Well found. That will be the setting needed to 'fake' the number of nodes. Ugly but effective. Maybe you can switch to procs if maui support improves. Gareth > I have also commented out the following line in maui.cfg. > > # JOBNODEMATCHPOLICY EXACTNODE > > Thanks to everyone for their help. Hopefully this problem is officially > behind us! > > -Lance > > > On Feb 23, 2012, at 12:05 PM, Lance Westerhoff wrote: > > > > > Hi Gareth- > > > > I've tried with and without the JOBNODEMATCHPOLICY set, and still no > luck. > > > > Whenever I try the following, I get the following error. That is even > though I have 144 processors in the cluster. So somehow torque/maui is > seeing my nodes as collections of 8 processors. > > > > $ qsub -I -lnodes=46 > > qsub: Job exceeds queue resource limits MSG=cannot locate feasible > nodes > > > > I found that someone else that has set up their user documentation to > show precisely the setting that I want, so it sounds like it can be > done. Specifically, as noted on the following link, "#PBS -l nodes=1 > asks PBS for one CPU. This means that when your jobs starts you will > have exclusive access to one CPU. But if you want something like 4 > nodes each with exactly 2 CPUs (total of 8 CPUs), then you would use > something like -l nodes=4:ppn=2. Instead, if you just want any 8 CPUs > in the cluster, you would request like just -l nodes=8." > http://bmi.cchmc.org/resources/software/torque/examples > > > > But I can't find a lot in the email archive to help. Part of the > problem is finding the right terms to look under though, so I'll keep > searching. > > > > Thanks for any further information that can be provided. > > > > -Lance > > > > > > On Feb 14, 2012, at 6:27 PM, > wrote: > > > >>> -----Original Message----- > >>> From: Lance Westerhoff [mailto:lance at quantumbioinc.com] > >>> Sent: Wednesday, 15 February 2012 5:12 AM > >>> To: Torque Users Mailing List > >>> Subject: Re: [torqueusers] procs= not working as documented (or > >>> understood?) > >>> > >>> > >>> Hello All- > >>> > >>> We're still having trouble with this feature, and we are starting > to > >>> shop around for a torque/maui replacement in order to be able to > use > >>> it. Before we do that however, I wanted to see if anyone has any > >>> thoughts on how to address the problem within torque/maui. Perhaps > I > >>> simply don't understand the feature. The versions of torque and > maui we > >>> are using are. > >>> > >>> torque-3.0.2 > >>> maui-3.2.6p21 > >>> > >>> Yes, we have tried newer versions of maui, but then the option > doesn't > >>> work at all. > >>> > >>> Here is the scenario (I also included the conversation from > November > >>> below for more information). > >>> > >>> Conceptually, our software is almost infinitely scalable in the > sense > >>> that there is very little overhead associated with interprocess > >>> communication. Therefore, we do not require that all of the > processes > >>> reside on a small number of nodes. In fact, we can stretch the > >>> processors to any and all nodes in the cluster with ~zero loss in > >>> performance. So we can literally have one node that has a single > >>> process running and another node that has 8 processes running. > Since we > >>> have that level of scalability, we don't want to have to lock > ourselves > >>> into having to request resources using the "nodes=X:ppn=Y" style > since > >>> this style requires that nodes open up or drain in order to use > them. > >>> Since our users have a big mixture of single and multi-processor > jobs, > >>> waiting for node drain can really waste a lot of resources. > >>> > >>> I saw the "procs=#" the Requesting Resources table (see > >>> > http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml#resou > >>> rces for more). It *appears* that this option should be able to > allow > >>> the user to request simply X*Y processors and the scheduler should > be > >>> able to schedule them any way it can fit. So using the following > #PBS > >>> note, we should be able to request 40 processors: > >>> > >>> #PBS -l procs=40 > >>> > >>> Instead, we see that the scheduler seems to take this information, > read > >>> it, and basically disregard it. The reason I know it reads it is > >>> because if I ask for say 40 processors and 40 processors are > available > >>> in the cluster, it works as expected and all is right with the > world. > >>> Where it gets a bit more choppy is when I ask for 40 processors and > >>> only 1 processor is available. The job doesn't wait in the queue > for > >>> the remaining 39 processors to open up, and instead PBS simply just > >>> starts the job on that processor. I can't see how that is anything > but > >>> a bug. If the user is asking for 40 processors, why isn't the > scheduler > >>> waiting for all 40 processors to open up? > >>> > >>> I'll also post this to the maui list so I apologize if you receive > it > >>> twice. I'm just not sure if this is a problem with torque, maui, or > >>> both. If answering this question will require additional > information, > >>> please ask. We are at our wits end here. > >>> > >>> Thanks! > >>> > >>> -Lance > >> > >> Hi Lance, > >> > >> It is more-or-less equivalent to request -l nodes=40 and -l procs=40 > _if_ > >> you are using maui _and_ you don't set JOBNODEMATCHPOLICY to > EXACTNODE > >> > >> You may need to 'fake' having a large number of nodes to make this > work. > >> > >> There are old mailing list items describing such a setup and how to > 'fake' > >> the number of nodes. We've never actually done this at our site so > I'm > >> unsure on details. I don't like it :-) but it may suit/help you. > >> > >> Gareth > >> > >> > >>> > >>> > >>> > >>> > >>> On Nov 18, 2011, at 11:12 AM, Lance Westerhoff wrote: > >>> > >>>> > >>>> Hi Steve- > >>>> > >>>> Here you go. Here is the top few lines of the job script. I have > then > >>> provided the output you requested long with the maui.cfg. If you > need > >>> anything further, certainly please let me know. > >>>> > >>>> Thanks for your help! > >>>> > >>>> =============== > >>>> > >>>> + head job.pbs > >>>> > >>>> #!/bin/bash > >>>> #PBS -S /bin/bash > >>>> #PBS -l procs=100 > >>>> #PBS -l pmem=700mb > >>>> #PBS -l walltime=744:00:00 > >>>> #PBS -j oe > >>>> #PBS -q batch > >>>> > >>>> Report run on Fri Nov 18 10:49:38 EST 2011 > >>>> + pbsnodes --version > >>>> version: 3.0.2 > >>>> + diagnose --version > >>>> maui client version 3.2.6p21 > >>>> + checkjob 371010 > >>>> > >>>> > >>>> checking job 371010 > >>>> > >>>> State: Running > >>>> Creds: user:josh group:games class:batch qos:DEFAULT > >>>> WallTime: 00:02:35 of 31:00:00:00 > >>>> SubmitTime: Fri Nov 18 10:46:33 > >>>> (Time Queued Total: 00:00:01 Eligible: 00:00:01) > >>>> > >>>> StartTime: Fri Nov 18 10:46:34 > >>>> Total Tasks: 1 > >>>> > >>>> Req[0] TaskCount: 26 Partition: DEFAULT > >>>> Network: [NONE] Memory >= 700M Disk >= 0 Swap >= 0 > >>>> Opsys: [NONE] Arch: [NONE] Features: [NONE] > >>>> Dedicated Resources Per Task: PROCS: 1 MEM: 700M > >>>> NodeCount: 10 > >>>> Allocated Nodes: > >>>> [compute-0-17:7][compute-0-10:4][compute-0-3:2][compute-0-5:3] > >>>> [compute-0-6:1][compute-0-7:2][compute-0-9:1][compute-0-12:2] > >>>> [compute-0-13:2][compute-0-14:2] > >>>> > >>>> > >>>> IWD: [NONE] Executable: [NONE] > >>>> Bypass: 0 StartCount: 1 > >>>> PartitionMask: [ALL] > >>>> Flags: RESTARTABLE > >>>> > >>>> Reservation '371010' (-00:02:09 -> 30:23:57:51 Duration: > >>> 31:00:00:00) > >>>> PE: 26.00 StartPriority: 4716 > >>>> > >>>> + cat /opt/maui/maui.cfg | grep -v "#" | grep "^[A-Z]" > >>>> SERVERHOST gondor > >>>> ADMIN1 maui root > >>>> ADMIN3 ALL > >>>> RMCFG[base] TYPE=PBS > >>>> AMCFG[bank] TYPE=NONE > >>>> RMPOLLINTERVAL 00:01:00 > >>>> SERVERPORT 42559 > >>>> SERVERMODE NORMAL > >>>> LOGFILE maui.log > >>>> LOGFILEMAXSIZE 10000000 > >>>> LOGLEVEL 3 > >>>> QUEUETIMEWEIGHT 1 > >>>> FSPOLICY DEDICATEDPS > >>>> FSDEPTH 7 > >>>> FSINTERVAL 86400 > >>>> FSDECAY 0.50 > >>>> FSWEIGHT 200 > >>>> FSUSERWEIGHT 1 > >>>> FSGROUPWEIGHT 1000 > >>>> FSQOSWEIGHT 1000 > >>>> FSACCOUNTWEIGHT 1 > >>>> FSCLASSWEIGHT 1000 > >>>> USERWEIGHT 4 > >>>> BACKFILLPOLICY FIRSTFIT > >>>> RESERVATIONPOLICY CURRENTHIGHEST > >>>> NODEALLOCATIONPOLICY MINRESOURCE > >>>> RESERVATIONDEPTH 8 > >>>> MAXJOBPERUSERPOLICY OFF > >>>> MAXJOBPERUSERCOUNT 8 > >>>> MAXPROCPERUSERPOLICY OFF > >>>> MAXPROCPERUSERCOUNT 256 > >>>> MAXPROCSECONDPERUSERPOLICY OFF > >>>> MAXPROCSECONDPERUSERCOUNT 36864000 > >>>> MAXJOBQUEUEDPERUSERPOLICY OFF > >>>> MAXJOBQUEUEDPERUSERCOUNT 2 > >>>> JOBNODEMATCHPOLICY EXACTNODE > >>>> NODEACCESSPOLICY SHARED > >>>> JOBMAXOVERRUN 99:00:00:00 > >>>> DEFERCOUNT 8192 > >>>> DEFERTIME 0 > >>>> CLASSCFG[developer] FSTARGET=40.00+ > >>>> CLASSCFG[lowprio] PRIORITY=-1000 > >>>> SRCFG[developer] CLASSLIST=developer > >>>> SRCFG[developer] ACCESS=dedicated > >>>> SRCFG[developer] DAYS=Mon,Tue,Wed,Thu,Fri > >>>> SRCFG[developer] STARTTIME=08:00:00 > >>>> SRCFG[developer] ENDTIME=18:00:00 > >>>> SRCFG[developer] TIMELIMIT=2:00:00 > >>>> SRCFG[developer] RESOURCES=PROCS(8) > >>>> USERCFG[DEFAULT] FSTARGET=100.0 > >>>> > >>>> =============== > >>>> > >>>> -Lance > >>>> > >>>> > >>>> On Nov 18, 2011, at 9:47 AM, Steve Crusan wrote: > >>>> > >>>>> -----BEGIN PGP SIGNED MESSAGE----- > >>>>> Hash: SHA1 > >>>>> > >>>>> > >>>>> On Nov 18, 2011, at 9:33 AM, Lance Westerhoff wrote: > >>>>> > >>>>>> The request that is placed is for procs=60. Both torque and maui > >>> see that there are only 53 processors available and instead of > letting > >>> the job sit in the queue and wait for all 60 processors to become > >>> available, it goes ahead and runs the job with what's available. > Now if > >>> the user could ask for procs=[50-60] where 50 is the minimum number > of > >>> processors to provide and 60 is the maximum, this would be a > feature. > >>> But as it stands, if the user asks for 60 processors and ends up > with 2 > >>> processors, the job just won't scale properly and he may as well > kill > >>> it (when it shouldn't have run anyway). > >>>>> > >>>>> Hi Lance, > >>>>> > >>>>> Can you post the output of checkjob of an > incorrectly > >>> running job. Let's take a look at what Maui thinks the job is > asking > >>> for. > >>>>> > >>>>> Might as well add your maui.cfg file also. > >>>>> > >>>>> I've found in the past that procs= is troublesome... > >>>>> > >>>>>> > >>>>>> I'm actually beginning to think the problem may be related to > maui. > >>> Perhaps I'll post this same question to the maui list and see what > >>> comes back. > >>>>>> > >>>>>> This problem is infuriating though since without the > functionality > >>> working as it should, using procs=X in torque/maui makes > torque/maui > >>> work more like a submission and run system (not a queuing system). > >>>>> > >>>>> Agreed. HPC cluster job management is normally be set it and > forget > >>> it. Anything else other than maintenance/break fixes/new features > would > >>> be ridiculously time consuming. > >>>>> > >>>>>> > >>>>>> -Lance > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Message: 3 > >>>>>>> Date: Thu, 17 Nov 2011 17:29:17 -0800 > >>>>>>> From: "Brock Palen" > >>>>>>> Subject: Re: [torqueusers] procs= not working as documented > >>>>>>> To: "Torque Users Mailing List" > >>>>>>> Message-ID: > >>> <20111118012930.C635E83A8026 at mail.adaptivecomputing.com> > >>>>>>> Content-Type: text/plain; charset="utf-8" > >>>>>>> > >>>>>>> Does maui only see one cpu or does mpiexec only see one cpu? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Brock Palen > >>>>>>> (734)936-1985 > >>>>>>> brockp at umich.edu > >>>>>>> - Sent from my Palm Pre, please excuse typos > >>>>>>> On Nov 17, 2011 3:19 PM, Lance Westerhoff > >>> <lance at quantumbioinc.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hello All- > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> It appears that when running with the following specs, the > procs= > >>> option does not actually work as expected. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ========================================== > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> #PBS -S /bin/bash > >>>>>>> > >>>>>>> #PBS -l procs=60 > >>>>>>> > >>>>>>> #PBS -l pmem=700mb > >>>>>>> > >>>>>>> #PBS -l walltime=744:00:00 > >>>>>>> > >>>>>>> #PBS -j oe > >>>>>>> > >>>>>>> #PBS -q batch > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> torque version: tried 3.0.2. in v2.5.4, I think the procs > option > >>> worked as documented > >>>>>>> > >>>>>>> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a > complete > >>> fail in terms of the procs option and it only asks for a single > CPU) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ========================================== > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> If there are fewer then 60 processors available in the cluster > (in > >>> this case there were 53 available) the job will go in an take > whatever > >>> is left instead of waiting for all 60 processors to free up. Any > >>> thoughts as to why this might be happening? Sometimes it doesn't > really > >>> matter and 53 would be almost as good as 60, however if only 2 > >>> processors are available and the user asks for 60, I would hate for > him > >>> to go in. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Thank you for your time! > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -Lance > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> torqueusers mailing list > >>>>>> torqueusers at supercluster.org > >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>>> > >>>>> ---------------------- > >>>>> Steve Crusan > >>>>> System Administrator > >>>>> Center for Research Computing > >>>>> University of Rochester > >>>>> https://www.crc.rochester.edu/ > >>>>> > >>>>> > >>>>> -----BEGIN PGP SIGNATURE----- > >>>>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > >>>>> Comment: GPGTools - http://gpgtools.org > >>>>> > >>>>> iQEcBAEBAgAGBQJOxnAEAAoJENS19LGOpgqK2CEH/Ry+THjmhxdTzcIZ5d5YYCP/ > >>>>> bYQY2QthvbaEkUhh+q26m2EWrmPGHRgW9zXOx/fRBE2ejZE+EycpRLMdWDTOxn28 > >>>>> cK1qs+ITaiOevNbxufd7pt/P5hhvafQgsDtuy8RPGokgqSuRBEH9i8DZAFfIASQZ > >>>>> tQ9YE5MSqEfaoTSwOVP2PXJCgEJh2ZU5GHO2UvmxF4SX4+7HePUgQYzmzIBu2cW8 > >>>>> JeeIpaf2AuNIvXjG3ZNA3FjHWQEZefiZhRTQxeE1PHuQCLWPnfTwz0nzquCHZBJv > >>>>> Ufc1wOGanDi+LosRldVIUgAyHGcAcOvZzFnxlfNrYa2xfJSCyuC86YB4XNfpO1c= > >>>>> =AGW7 > >>>>> -----END PGP SIGNATURE----- > >>>>> _______________________________________________ > >>>>> torqueusers mailing list > >>>>> torqueusers at supercluster.org > >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers > >>>> > >>> > >> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > > From Gareth.Williams at csiro.au Thu Feb 23 13:51:55 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 24 Feb 2012 07:51:55 +1100 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: References: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E8E@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Sreedhar Manchu [mailto:sm4082 at nyu.edu] > Sent: Friday, 24 February 2012 1:36 AM > To: Torque Users Mailing List > Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable. > > Hi Ken, > > This morning I did few more test runs and found out it's #PBS -V in > script that's causing problems. Whenever I include it I'm seeing this > error. > > I think it's trying to look for this variable in user's exported > environment rather than looking into variables defined by it self. > Everything works fine if I take -V off the script. > > But the problem is some users need to mention this in their scripts to > export not only user defined variables but also system variables such > as LD_LIBRARY_PATH. > > Is there a way to fix it? Now I'll do some more tests to see we have > same problem with other pbs variables such as PBS_O_WORKDIR etc. > > Please let me know if you have any suggestions. I will send another > email once I do more testing. > > Thanks > Sreedhar. Hi Sreedhar, I see you are not sure that -V is the problem, but nevertheless I'd recommend to use -v with an explicit set of variables to be passed rather than -V. In most cases I'd say it is actually better to embed the variables in the script so it doesn't matter what environment you submit the job from - and use neither -v or -V. Many sites use 'environment modules' to set such variables. Good luck, Gareth > > -- > Sent from my phone. Please excuse my brevity and any typos. > > On Feb 22, 2012, at 14:58, Ken Nielson > wrote: > > > ----- Original Message ----- > >> From: "Sreedhar Manchu" > >> To: "Torque Users Mailing List" > >> Sent: Wednesday, February 22, 2012 11:22:16 AM > >> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. > >> > >> Hi, > >> > >> Recently, I have upgraded Torque to it's 2.5.10 version. Since then > >> we have been seeing this error "PBS_NODEFILE: Undefined variable.". > >> If we restart pbs mom then everything works fine. Does anyone have > >> any idea what's causing this behavior? > >> > >> Please let me know if you need any information that could help in > >> figuring out the problem. > >> > >> Thanks, > >> Sreedhar. > > > > Sreedhar, > > > > Is the error showing up at the console or in the log file? > > > > Ken > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Thu Feb 23 14:00:53 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 23 Feb 2012 16:00:53 -0500 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: References: Message-ID: <268689C2-5E23-477A-8CC1-07828C32A139@nyu.edu> Hi Ken, Unfortunately, it's happening again. It works for a while after the restart. But it starts happening all again after a while. File (/opt/torque/aux/PBS_JOBID) is there in /aux/ but some how PBS_NODEFILE is not getting initialized to this path. Lot of PBS variables are missing in environment. PBS_NODEFILE is one of them. For now, we wrote a script asking users to source it. It checks for PBS_NODEFILE variable and if it doesn't exist then it assigns the path of /aux/PBS_JOBID to it. Not sure whether we're the only ones facing or there is a problem with 2.5.10 version it self. Please let me know if you need any more information from my side in case you want to look into this problem. Thanks, Sreedhar. On Feb 23, 2012, at 12:28 PM, Ken Nielson wrote: > Sreedhar, > > I'm glad it is working again. At least we know that the MOM gets into some state that is cleared up by a restart. Let us know if it happens again. > > Ken > > ----- Original Message ----- >> From: "Sreedhar Manchu" >> To: "Torque Users Mailing List" >> Sent: Thursday, February 23, 2012 10:02:42 AM >> Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable. >> >> Ken, >> >> I have restarted pbs_mom and now it works with #PBS -V in the script. >> Not sure what exactly is happening. >> >> Sreedhar. >> >> On Feb 22, 2012, at 10:14 PM, Sreedhar Manchu wrote: >> >>> Hi Ken, >>> >>> First, thank you for your response. We see the problem with mpi >>> jobs. When people use PBS_NODEFILE variable to define the hosts >>> for mpiexec to run we are seeing this error. It doesn't happen all >>> the time. I tried to test some simple jobs to see whether I can >>> see this variable on some nodes and it was ok. Problem is that >>> it's happening randomly. For whatever reasons this variable >>> PBS_NODEFILE is getting initialized or defined in the job >>> environment. >>> >>> Whenever it happened we restarted pbs_moms and it was ok. At this >>> point I'm not sure whether it's happening repeatedly on the same >>> nodes. Now I'm waiting to see whether it happens on the same nodes >>> again. >>> >>> We see this error in .err files whenever people try to access >>> PBS_NODEFILE. If we restart pbs_mom once the job fails with this >>> error, another job with the same script works fine with out any >>> errors. >>> >>> For example, one user runs his job and at the end of his script he >>> submits another job with this statement. >>> >>> ssh -x login-0-0 "cd /home ; qsub run-openmpi.sh" >>> >>> We don't allow users to submit jobs from compute nodes. We have a >>> submit filter in place on login nodes. This user pretty much >>> submits the same job from the running job. Which means it ran well >>> first time, but failed to do so because of above mentioned error. >>> Another user gets exact error as in the subject when she tried to >>> access PBS_NODEFILE as var=$PBS_NODEFILE. >>> >>> First user tries to run his job with this statement. >>> >>> /share/apps/openmpi/1.4.3/intel/bin/mpiexec \ >>> -n 144 -hostfile $PBS_NODEFILE \ >>> env OMP_NUM_THREADS=1 \ >>> /home/mitgcmuv >>> >>> >>> We see this error in the .err file. As you can see, since it >>> couldn't find the $PBS_NODEFILE it thinks of env as a hostfile and >>> fails. >>> -------------------------------------------------------------------------- >>> Open RTE was unable to open the hostfile: >>> env >>> Check to make sure the path and filename are correct. >>> -------------------------------------------------------------------------- >>> >>> Once we restarted the pbs_mom same script worked fine. I'm not sure >>> what's causing this. I don't see anything wrong either in torque >>> logs or syslogs. >>> >>> Today I made all nodes offline and plan to restart pbs_mom on all >>> nodes hoping this would fix the issue forever, even though I doubt >>> it might not. As we're not sure whether it is happening on the >>> same nodes, restarting all nodes might give us an idea on this as >>> well. >>> >>> Please let us know if you have any thoughts on what might be >>> happening with our case. >>> >>> Thank you once again for your response and time. >>> >>> Regards, >>> Sreedhar. >>> >>> On Feb 22, 2012, at 2:58 PM, Ken Nielson wrote: >>> >>>> ----- Original Message ----- >>>>> From: "Sreedhar Manchu" >>>>> To: "Torque Users Mailing List" >>>>> Sent: Wednesday, February 22, 2012 11:22:16 AM >>>>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. >>>>> >>>>> Hi, >>>>> >>>>> Recently, I have upgraded Torque to it's 2.5.10 version. Since >>>>> then >>>>> we have been seeing this error "PBS_NODEFILE: Undefined >>>>> variable.". >>>>> If we restart pbs mom then everything works fine. Does anyone >>>>> have >>>>> any idea what's causing this behavior? >>>>> >>>>> Please let me know if you need any information that could help in >>>>> figuring out the problem. >>>>> >>>>> Thanks, >>>>> Sreedhar. >>>> >>>> Sreedhar, >>>> >>>> Is the error showing up at the console or in the log file? >>>> >>>> Ken >>>> _______________________________________________ >>>> torqueusers mailing list >>>> torqueusers at supercluster.org >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> --- >>> Sreedhar Manchu >>> HPC Support Specialist >>> New York University >>> 251 Mercer Street >>> New York, NY 10012-1110 >>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From sm4082 at nyu.edu Thu Feb 23 14:19:33 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Thu, 23 Feb 2012 16:19:33 -0500 Subject: [torqueusers] PBS_NODEFILE: Undefined variable. In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74E8E@exvic-mbx04.nexus.csiro.au> References: <8f3783b1-33f8-4ee2-bd28-3cc3f016957c@mail> <007DECE986B47F4EABF823C1FBB19C620102CDD74E8E@exvic-mbx04.nexus.csiro.au> Message-ID: <019CBB41-98AA-4873-8823-BF6BC8716175@nyu.edu> Hi Gareth, We also use environment modules. But the problem is that for parallel jobs mentioning "load module " loads modules on mother superior. But for mpiexec compiled with intel compilers, even mentioning load module doesn't load required modules on the other nodes involved in parallel job. So we ask users to add "load module to their .bashrc and doing #PBS -V exports the whole environment onto all the nodes and so everything works well. Overall, I'm not sure whether there is a problem with this version 2.5.10 or there is a problem with the way I installed torque. But for now we wrote a script to check for PBS_NODEFILE variable and if it doesn't exist source the script so that the path of the file in /opt/torque/aux/PBS_JOBID to it. I am adding this to qsub wrapper so that it adds a line to source the script to user's pbs script at the end of pbs directives. For now this works ok. Recently, tons of jobs failed because of this error. Hopefully, this fixes the problem. Thanks again for writing. Regards, Sreedhar. On Feb 23, 2012, at 3:51 PM, wrote: >> -----Original Message----- >> From: Sreedhar Manchu [mailto:sm4082 at nyu.edu] >> Sent: Friday, 24 February 2012 1:36 AM >> To: Torque Users Mailing List >> Subject: Re: [torqueusers] PBS_NODEFILE: Undefined variable. >> >> Hi Ken, >> >> This morning I did few more test runs and found out it's #PBS -V in >> script that's causing problems. Whenever I include it I'm seeing this >> error. >> >> I think it's trying to look for this variable in user's exported >> environment rather than looking into variables defined by it self. >> Everything works fine if I take -V off the script. >> >> But the problem is some users need to mention this in their scripts to >> export not only user defined variables but also system variables such >> as LD_LIBRARY_PATH. >> >> Is there a way to fix it? Now I'll do some more tests to see we have >> same problem with other pbs variables such as PBS_O_WORKDIR etc. >> >> Please let me know if you have any suggestions. I will send another >> email once I do more testing. >> >> Thanks >> Sreedhar. > > Hi Sreedhar, > > I see you are not sure that -V is the problem, but nevertheless I'd recommend to use -v with an explicit set of variables to be passed rather than -V. In most cases I'd say it is actually better to embed the variables in the script so it doesn't matter what environment you submit the job from - and use neither -v or -V. Many sites use 'environment modules' to set such variables. > > Good luck, > > Gareth > >> >> -- >> Sent from my phone. Please excuse my brevity and any typos. >> >> On Feb 22, 2012, at 14:58, Ken Nielson >> wrote: >> >>> ----- Original Message ----- >>>> From: "Sreedhar Manchu" >>>> To: "Torque Users Mailing List" >>>> Sent: Wednesday, February 22, 2012 11:22:16 AM >>>> Subject: [torqueusers] PBS_NODEFILE: Undefined variable. >>>> >>>> Hi, >>>> >>>> Recently, I have upgraded Torque to it's 2.5.10 version. Since then >>>> we have been seeing this error "PBS_NODEFILE: Undefined variable.". >>>> If we restart pbs mom then everything works fine. Does anyone have >>>> any idea what's causing this behavior? >>>> >>>> Please let me know if you need any information that could help in >>>> figuring out the problem. >>>> >>>> Thanks, >>>> Sreedhar. >>> >>> Sreedhar, >>> >>> Is the error showing up at the console or in the log file? >>> >>> Ken >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From samuel at unimelb.edu.au Thu Feb 23 21:54:25 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 24 Feb 2012 15:54:25 +1100 Subject: [torqueusers] torque tmpdir on Lustre filesystem In-Reply-To: <4F3CD1E6.3050309@cyf-kr.edu.pl> References: <4F3CD1E6.3050309@cyf-kr.edu.pl> Message-ID: <4F471801.6000207@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 16/02/12 20:52, Lukasz Flis wrote: > I'm just curious if we're the only site in the world facing the > problem. We don't use Lustre (we have Panasas and GPFS), but just wondering does this happen all the time, or only occasionally ? If occasionaly then if the job fails once, will it always fail, or will it work if you try again? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9HGAEACgkQO2KABBYQAh8GLQCePQRIDEoBH60XPQU3MoPK93oM KIkAnRx1WHTJ2l5hNjGd4Deg8dCxTbDY =QJLk -----END PGP SIGNATURE----- From mailmaverick666 at gmail.com Thu Feb 23 22:10:13 2012 From: mailmaverick666 at gmail.com (rishi pathak) Date: Fri, 24 Feb 2012 10:40:13 +0530 Subject: [torqueusers] torque tmpdir on Lustre filesystem In-Reply-To: <4F3CD1E6.3050309@cyf-kr.edu.pl> References: <4F3CD1E6.3050309@cyf-kr.edu.pl> Message-ID: We use HP Scalable File Share(a variant of Lustre) set as tmpdir and have not seen this error. It is being used widely by users via data staging and so far we have not encountered any such errors. Our server and clients are on RHEL5 though On Thu, Feb 16, 2012 at 3:22 PM, Lukasz Flis wrote: > Hello, > > I would like to ask Lustre users if anyone has pbs_mom variable $tmpdir > set to a directory located on Lustre filesystem? > > Such kind of setup in our case (Lustre 2.1.0 and Lustre 1.8.6) results > in Torque being unable to create temprorary directory. Not all attempts > are unsuccessful but it is still significant amount. > > > Feb 14 13:56:02 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in > TMakeTmpDir, Unable to make job transient directory: > /mnt/lustre/scratch/jobs/18555591.batch.grid > Feb 14 13:56:23 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in > TMakeTmpDir, Unable to make job transient directory: > /mnt/lustre/scratch/jobs/18555647.batch.grid > Feb 14 14:37:35 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in > TMakeTmpDir, Unable to make job transient directory: > /mnt/lustre/scratch/jobs/18557701.batch.grid > Feb 14 14:38:17 n6-4-16 pbs_mom: LOG_ERROR::Permission denied (13) in > TMakeTmpDir, Unable to make job transient directory: > /mnt/lustre/scratch/jobs/18557716.batch.grid > > This is not a bug in torque but problem with lustre itself. > > I'm just curious if we're the only site in the world facing the problem. > > Here is a bug report in jira at whamcloud for more details: > > http://jira.whamcloud.com/browse/LU-1101 > > > Cheers, > -- > Lukasz Flis > ACC Cyfronet AGH, > Cracow, POLAND > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- --- Rishi Pathak National PARAM Supercomputing Facility C-DAC, Pune, India -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120224/d477d678/attachment.html From cholam20 at yahoo.co.in Fri Feb 24 02:44:08 2012 From: cholam20 at yahoo.co.in (revathi ganesh) Date: Fri, 24 Feb 2012 15:14:08 +0530 (IST) Subject: [torqueusers] Fwd: Found interesting opportunity... Message-ID: <1330076648.96540.androidMobile@web137306.mail.in.yahoo.com>

Whats up
whats up.
I feel like ive been distant lately I consider myself lucky to have found this I kept telling myself things would get better!
http://oaza-swanna.info/lastnews/48GrahamClark/ I love not having responsibilities anymore
I had to share this with someone...

see you!

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120224/4363a325/attachment-0001.html From Gareth.Williams at csiro.au Fri Feb 24 03:03:12 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Fri, 24 Feb 2012 21:03:12 +1100 Subject: [torqueusers] vmem and pvmem In-Reply-To: <20120215231031.GA956@stikine.sfu.ca> References: <20120215231031.GA956@stikine.sfu.ca> Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Martin Siegert [mailto:siegert at sfu.ca] > Sent: Thursday, 16 February 2012 10:11 AM > To: torqueusers at supercluster.org > Subject: [torqueusers] vmem and pvmem > > Hi, > > I am struggling with the implmentation of the vmem and pvmem > resources (in principle the exact same concerns apply to mem > and pmem). Let's say I set > > resources_default.pvmem = 1gb > > in qmgr. Now a user submits a job requesting -l procs=1,vmem=2gb > and the job fails because it exceeds the pvmem default. Apparently > torque treats vmem and pvmem as two independent resources, which > (in particular for 1p jobs) is not very reasonable. > Similarly, if I would set resources_default.vmem, jobs that > request pvmem fail even if the specified > (amount of pvmem)*(no. of requested processors) > vmem default > > How do people deal with this issue? > As far as I can tell moab only uses pmem and pvmem, i.e., moab > converts vmem to pvmem = vmem/procct and mem to vmem = mem/procct. > Correct? Shouldn't torque do the same? > I am worried about shared memory jobs though - jobs were pvmem > is not really relevant since all processes share the same memory > and vmem is not simply the sum of the process memory usage (at > least you cannot add up amounts displayed by ps, etc.). But I do > not know whether torque handles this correctly anyway, does it? > > For now I modified the torque_submitfilter posted by Gareth > http://www.clusterresources.com/pipermail/torquedev/2011- > March/003479.html > (thanks Gareth!) to add a qsub option -l pvmem=... in those > cases when the user requests vmem, but this appears to be an ugly > workaround. Shouldn't there be a better way? > > Cheers, > Martin Hi Martin, I hoped someone else would bite but no joy! We expect/advise people to only request vmem and set a modest default vmem which forces them to explicitly specify vmem in most cases. The pbs_resources man page describes vmem as an aggregate limit across the whole job. I wanted to be sure what actually happened with both vmem and pvmem requested (and no pvmem specified) so I ran a simple test, starting a multi-cpu job and looking at the imposed 'ulimit'. Core_req vmem pvmem ulimit-v RPT ========================================= nodes=1:ppn=2 1gb 256mb 256mb 512mb procs=2 1gb 256mb 256mb 1gb nodes=1:ppn=2 1gb 4gb 1gb 4gb procs=2 1gb 4gb 1gb 4gb nodes=1:ppn=2 1gb - 1gb 512mb procs=2 1gb - 1gb 1gb So the ulimit value that influences whether a task can allocate memory, is set as the lower of the vmem and pvmem values. That makes some sense - at least more sense than taking the larger value. What doesn't make sense is allowing pvmem to be higher than vmem in the first place - in that case torque should probably reject the job or 'fix' one of the settings but leaving it as is might not be so bad, except for moab's behaviour (keep reading). I've noted in the past that setting the ulimit to vmem (the aggregate limit for the job) is highly conservative, but someone just might want to use most of the memory is a parallel job in one task. I did not include in the test what torque does if you run multiple tasks and exceed the vmem limit in aggregate. I suspect torque will kill the job in that case when it noticed but it can get to that state as long as each task stays under the ulimit setting. Of course the job might fill memory first... The last column is the Resources Per Task (my shorthand) that moab dedicates in its scheduling (you can see it in the output of checkjob). As you can see this seems wacky: - with nodes/ppn it is the largest of vmem/procct and pvmem - with procs it the larger of vmem and pvmem In neither case do these limits agree with the ulimit set by torque. Moab might also kill jobs if it thinks that limits are exceeded but seems unlikely to get a chance. The different treatment of nodes/ppn and procs by moab is a gotcha and I'd consider it a bug. Moab seems to consider vmem to be a per task setting rather than an aggregate if you specify procs. I'd be interested to know if maui or pbs_sched dedicates a different amount of vmem per task. So all up, answering Martin's question further, apart from sticking to vmem only, I'd advise only also specify pvmem if you want to explicitly state that the memory should be evenly divided (pretty common!) and have the ulimit reflect that - and of course it should be set to vmem/procct. Is that what your new filter does Martin? Just setting pvmem might be a good option though might not allow for the asymmetric memory case... Gareth > > -- > Martin Siegert > Head, Research Computing > Simon Fraser University > Burnaby, British Columbia From toth at fi.muni.cz Fri Feb 24 03:19:37 2012 From: toth at fi.muni.cz (=?windows-1252?Q?=22Mgr=2E_=8Aimon_T=F3th=22?=) Date: Fri, 24 Feb 2012 11:19:37 +0100 Subject: [torqueusers] vmem and pvmem In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> References: <20120215231031.GA956@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> Message-ID: <4F476439.10609@fi.muni.cz> > Core_req vmem pvmem ulimit-v RPT > ========================================= > nodes=1:ppn=2 1gb 256mb 256mb 512mb > procs=2 1gb 256mb 256mb 1gb > nodes=1:ppn=2 1gb 4gb 1gb 4gb > procs=2 1gb 4gb 1gb 4gb > nodes=1:ppn=2 1gb - 1gb 512mb > procs=2 1gb - 1gb 1gb > > So the ulimit value that influences whether a task can allocate > memory, is set as the lower of the vmem and pvmem values. That > makes some sense - at least more sense than taking the larger > value. What doesn't make sense is allowing pvmem to be higher > than vmem in the first place - in that case torque should probably > reject the job or 'fix' one of the settings but leaving it as is > might not be so bad, except for moab's behaviour (keep reading). No. The logic is as follows: * if pvmem (or pmem) is set then set the corresponding ulimit to pvmem (pmem) value * if pvmem (or pmem) isn't set then set the corresponding ulimit to vmem (mem) value Note that using pvmem is mostly pointless. On Linux this represents address space, not virtual memory. You can use vmem as virtual memory, but even that is extremely confusing. -- Mgr. Simon Toth From l.flis at cyf-kr.edu.pl Fri Feb 24 03:24:30 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Fri, 24 Feb 2012 11:24:30 +0100 Subject: [torqueusers] torque tmpdir on Lustre filesystem In-Reply-To: <4F471801.6000207@unimelb.edu.au> References: <4F3CD1E6.3050309@cyf-kr.edu.pl> <4F471801.6000207@unimelb.edu.au> Message-ID: <4F47655E.6060503@cyf-kr.edu.pl> Hello Christopher, Hi * > > We don't use Lustre (we have Panasas and GPFS), but just wondering > does this happen all the time, or only occasionally ? It happens occasionaly. But as I said - this seems like bug in Lustre FS, and it has nothing to do with torque code. Torque is using unlucky sequence of stat/mkdir functions which exposes lustre misbehaviour. > If occasionaly then if the job fails once, will it always fail, or > will it work if you try again? Another call to the mkdirtree() function should succeed after few seconds of sleep. I belive this behaviour in Lustre client appeared in 1.8.x line and remains in 2.1.X. HP SFS IIRC is based on 1.4 and 1.6 so it's not affected. We have observed the BUG in Lustre 1.8.(4,5.6) infrastructure. Then we moved to 2.1 line replacing all the components (servers,arrays,fabric) and the bug remained. The problem with lustre is that mkdir() call on EXISTING directory returns EPERM error instead of EEXIST once in a while, usually when stat() is called before mkdir. I belive doing mkdir on a existing path is not very common practice and that's the reason the BUG was unnoticed for a long time Cheers, -- Lukasz Flis From siegert at sfu.ca Fri Feb 24 15:00:09 2012 From: siegert at sfu.ca (Martin Siegert) Date: Fri, 24 Feb 2012 14:00:09 -0800 Subject: [torqueusers] vmem and pvmem In-Reply-To: <4F476439.10609@fi.muni.cz> References: <20120215231031.GA956@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> <4F476439.10609@fi.muni.cz> Message-ID: <20120224220009.GC29630@stikine.sfu.ca> On Fri, Feb 24, 2012 at 11:19:37AM +0100, "Mgr. ?imon T?th" wrote: > > Core_req vmem pvmem ulimit-v RPT > > ========================================= > > nodes=1:ppn=2 1gb 256mb 256mb 512mb > > procs=2 1gb 256mb 256mb 1gb > > nodes=1:ppn=2 1gb 4gb 1gb 4gb > > procs=2 1gb 4gb 1gb 4gb > > nodes=1:ppn=2 1gb - 1gb 512mb > > procs=2 1gb - 1gb 1gb > > > > So the ulimit value that influences whether a task can allocate > > memory, is set as the lower of the vmem and pvmem values. That > > makes some sense - at least more sense than taking the larger > > value. What doesn't make sense is allowing pvmem to be higher > > than vmem in the first place - in that case torque should probably > > reject the job or 'fix' one of the settings but leaving it as is > > might not be so bad, except for moab's behaviour (keep reading). > > No. The logic is as follows: > > * if pvmem (or pmem) is set > then set the corresponding ulimit to pvmem (pmem) value > > * if pvmem (or pmem) isn't set > then set the corresponding ulimit to vmem (mem) value > > Note that using pvmem is mostly pointless. On Linux this represents > address space, not virtual memory. > > You can use vmem as virtual memory, but even that is extremely confusing. I do not understand this comment. Both pvmem and vmem requests will result in RLIMIT_AS getting set. When I submit a MPI job using, e.g., procs=N, why is requesting pvmem=X mostly pointless? Shouldn't it be totally equivalent to requesting vmem=X*N ? Cheers, Martin From David.Singleton at anu.edu.au Fri Feb 24 16:50:00 2012 From: David.Singleton at anu.edu.au (David Singleton) Date: Sat, 25 Feb 2012 10:50:00 +1100 Subject: [torqueusers] vmem and pvmem In-Reply-To: <20120224220009.GC29630@stikine.sfu.ca> References: <20120215231031.GA956@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> <4F476439.10609@fi.muni.cz> <20120224220009.GC29630@stikine.sfu.ca> Message-ID: <4F482228.8080208@anu.edu.au> On 02/25/2012 09:00 AM, Martin Siegert wrote: > On Fri, Feb 24, 2012 at 11:19:37AM +0100, "Mgr. ?imon T?th" wrote: >>> Core_req vmem pvmem ulimit-v RPT >>> ========================================= >>> nodes=1:ppn=2 1gb 256mb 256mb 512mb >>> procs=2 1gb 256mb 256mb 1gb >>> nodes=1:ppn=2 1gb 4gb 1gb 4gb >>> procs=2 1gb 4gb 1gb 4gb >>> nodes=1:ppn=2 1gb - 1gb 512mb >>> procs=2 1gb - 1gb 1gb >>> >>> So the ulimit value that influences whether a task can allocate >>> memory, is set as the lower of the vmem and pvmem values. That >>> makes some sense - at least more sense than taking the larger >>> value. What doesn't make sense is allowing pvmem to be higher >>> than vmem in the first place - in that case torque should probably >>> reject the job or 'fix' one of the settings but leaving it as is >>> might not be so bad, except for moab's behaviour (keep reading). >> >> No. The logic is as follows: >> >> * if pvmem (or pmem) is set >> then set the corresponding ulimit to pvmem (pmem) value >> >> * if pvmem (or pmem) isn't set >> then set the corresponding ulimit to vmem (mem) value >> >> Note that using pvmem is mostly pointless. On Linux this represents >> address space, not virtual memory. >> >> You can use vmem as virtual memory, but even that is extremely confusing. > > I do not understand this comment. Both pvmem and vmem requests will > result in RLIMIT_AS getting set. I disagree with vmem setting RLIMIT_AS if that is what is happening. > When I submit a MPI job using, e.g., procs=N, why is requesting > pvmem=X mostly pointless? Shouldn't it be totally equivalent to > requesting vmem=X*N ? > I think we have had the discussion of what procs means on a number of occasions (look for the thread "processes vs processors"). I believe "procs" (now) means (virtual) processORs (most commonly, they are cores). They are not processes. [In OpenPBS they were processes and only the UNICOS MOM supported that limit. At least in torque-3.0.2 procs is still not properly documented in pbs_resources* man pages.] pvmem sets some sort of memory limit per *process* so vmem should have nothing to do with procs and pvmem. pvmem and vmem are pretty much orthogonal. One is a voluntary limit the user places on their job processes (useless for actual resource scheduling) and the other is something any well-configured system should require a user to specify so that the resources of the system can be managed. In particular a job with only a pvmem limit can OOM any size node simply by spawning enough processes. Setting both independently (should a user choose to do so) seems perfectly sensible. But I agree with Gareth that it only makes sense to request vmem. Now what vmem actually is and how is should be evaluated and limited is a whole other discussion ... David From toth at fi.muni.cz Sat Feb 25 03:18:30 2012 From: toth at fi.muni.cz (=?UTF-8?B?Ik1nci4gxaBpbW9uIFTDs3RoIg==?=) Date: Sat, 25 Feb 2012 11:18:30 +0100 Subject: [torqueusers] vmem and pvmem In-Reply-To: <4F482228.8080208@anu.edu.au> References: <20120215231031.GA956@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> <4F476439.10609@fi.muni.cz> <20120224220009.GC29630@stikine.sfu.ca> <4F482228.8080208@anu.edu.au> Message-ID: <4F48B576.6040009@fi.muni.cz> On 25.2.2012 00:50, David Singleton wrote: > On 02/25/2012 09:00 AM, Martin Siegert wrote: >> On Fri, Feb 24, 2012 at 11:19:37AM +0100, "Mgr. ?imon T?th" wrote: >>>> Core_req vmem pvmem ulimit-v RPT >>>> ========================================= >>>> nodes=1:ppn=2 1gb 256mb 256mb 512mb >>>> procs=2 1gb 256mb 256mb 1gb >>>> nodes=1:ppn=2 1gb 4gb 1gb 4gb >>>> procs=2 1gb 4gb 1gb 4gb >>>> nodes=1:ppn=2 1gb - 1gb 512mb >>>> procs=2 1gb - 1gb 1gb >>>> >>>> So the ulimit value that influences whether a task can allocate >>>> memory, is set as the lower of the vmem and pvmem values. That >>>> makes some sense - at least more sense than taking the larger >>>> value. What doesn't make sense is allowing pvmem to be higher >>>> than vmem in the first place - in that case torque should probably >>>> reject the job or 'fix' one of the settings but leaving it as is >>>> might not be so bad, except for moab's behaviour (keep reading). >>> >>> No. The logic is as follows: >>> >>> * if pvmem (or pmem) is set >>> then set the corresponding ulimit to pvmem (pmem) value >>> >>> * if pvmem (or pmem) isn't set >>> then set the corresponding ulimit to vmem (mem) value >>> >>> Note that using pvmem is mostly pointless. On Linux this represents >>> address space, not virtual memory. >>> >>> You can use vmem as virtual memory, but even that is extremely confusing. >> >> I do not understand this comment. Both pvmem and vmem requests will >> result in RLIMIT_AS getting set. > > I disagree with vmem setting RLIMIT_AS if that is what is happening. Yes it does. It is logical, but also pointless, since this value is checked by the node itself. Therefore if the process is not killed by the ulimit it will be killed by the node daemon. The issue here is that RLIMIT_AS is limiting address space (on Linux). That's not a resource. If you map a 1GB file to memory your are not using up anything. The correct resource to be limited is mem (pmem) not vmem (pvmem). -- Mgr. Simon Toth From dina.mahmoud87 at gmail.com Sat Feb 25 04:26:37 2012 From: dina.mahmoud87 at gmail.com (Dina Mahmoud) Date: Sat, 25 Feb 2012 13:26:37 +0200 Subject: [torqueusers] [OSC mpiexec] Integrating it with torque Message-ID: Hey guys, i'm having a problem integrating osc mpiexec with torque i configured the mpiexec normally using this command: ./configure prefix=/usr/local/mpiexec --with-pbs=/usr/local/torque --with-default-comm=mpich2-pmi but when i run "make" i get the following error: /usr/bin/ld: skipping incompatible /usr/local/torque/lib/libtorque.so when searching for -ltorque /usr/bin/ld: skipping incompatible /usr/local/torque/lib/libtorque.a when searching for -ltorque /usr/bin/ld: cannot find -ltorque collect2: ld returned 1 exit status make: *** [mpiexec] Error 1 Can Any one help in this error ??? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120225/cf5e8392/attachment.html From akohlmey at cmm.chem.upenn.edu Sat Feb 25 07:57:40 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Sat, 25 Feb 2012 09:57:40 -0500 Subject: [torqueusers] [OSC mpiexec] Integrating it with torque In-Reply-To: References: Message-ID: On Sat, Feb 25, 2012 at 6:26 AM, Dina Mahmoud wrote: > Hey guys, > > i'm having a problem integrating osc mpiexec with torque > i configured the mpiexec normally using this command: > ./configure prefix=/usr/local/mpiexec --with-pbs=/usr/local/torque > --with-default-comm=mpich2-pmi > > but when i run "make" i get the following error: > /usr/bin/ld: skipping incompatible /usr/local/torque/lib/libtorque.so when > searching for -ltorque > /usr/bin/ld: skipping incompatible /usr/local/torque/lib/libtorque.a when > searching for -ltorque > /usr/bin/ld: cannot find -ltorque > collect2: ld returned 1 exit status > make: *** [mpiexec] Error 1 > > Can Any one help in this error ??? looks like your torque installation is 32-bit and you are trying to compile a 64-bit mpiexec. that won't work, of course. axel. > > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From sm4082 at nyu.edu Sat Feb 25 12:19:42 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Sat, 25 Feb 2012 14:19:42 -0500 Subject: [torqueusers] Step Change in Job Arrays In-Reply-To: References: Message-ID: Hi Ibad, Even though torque doesn't support this way of submitting arrays and if you want users be able to submit jobs just like people do on SGE, you can make it happen with qsub wrapper or filter. Once users submit with -t 1-20:2 or #PBS -t 1-20:2 you can change this with wrapper using Andrew's method to the way torque likes. In fact, I'm going to add this our wrapper some time next week. Once I have it I'll post it here. I guess it makes easier for users. Sreedhar. On Wed, Feb 8, 2012 at 9:47 AM, Ibad Kureshi U0850037 wrote: > Thanks Glen, Andy > > Andy: Nice! > > -Ibad > > > ________________________________________ > From: torqueusers-bounces at supercluster.org [ > torqueusers-bounces at supercluster.org] On Behalf Of Andrew Caird [ > acaird at umich.edu] > Sent: Wednesday, February 08, 2012 2:28 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Step Change in Job Arrays > > On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane glen.beane at gmail.com>> wrote: > On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037 > > wrote: > > Hello, > > > > I was wondering is someone could tell me how to adjust the step size in > a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small > cluster and our users want to submit arrays. > > > > One the SGE and the Moab/Torque based systems > > > > $ -t 1-20:2 > > > > or > > > > #PBS -t 1-20:2 > > > > respectively, gives them 10 jobs with even ID numbers. > > > > How can this be done with Torque? It throws out "qsub: Bad Job Array > Request" error > > > > Have not been able to find much literature on this. > > > > Thanks > > > this is not currently supported, but it is a great feature request. > > unfortunately the only option would be to explicitly specify each array ID: > > #PBS -t 2,4,6,8,10 ...20 > > Or: > > qsub -t `seq -s, 2 2 20` pbsfile.txt > > in case you don't want to type all the numbers. > > --andy > > > --- > This transmission is confidential and may be legally privileged. If you > receive it in error, please notify us immediately by e-mail and remove it > from your system. If the content of this e-mail does not relate to the > business of the University of Huddersfield, then we do not endorse it and > will accept no liability. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120225/aaa48f58/attachment.html From samuel at unimelb.edu.au Sat Feb 25 18:06:35 2012 From: samuel at unimelb.edu.au (Chris Samuel) Date: Sun, 26 Feb 2012 12:06:35 +1100 Subject: [torqueusers] vmem and pvmem In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> References: <20120215231031.GA956@stikine.sfu.ca> <007DECE986B47F4EABF823C1FBB19C620102CDD74E97@exvic-mbx04.nexus.csiro.au> Message-ID: <201202261206.35271.samuel@unimelb.edu.au> On Friday 24 February 2012 21:03:12 Gareth.Williams at csiro.au wrote: > We expect/advise people to only request vmem and set a > modest default vmem which forces them to explicitly specify > vmem in most cases. The pbs_resources man page describes vmem > as an aggregate limit across the whole job. We *currently* have two queues, one called "smp" for, well, SMP codes that automatically get allocated an entire node (our submit filter rejects jobs that try and ask for nodes or procs in that queue) and the default "batch" queue for MPI and single CPU jobs. The batch queue has a default pvmem of 1gb and users then request more if they need it. This means that because pvmem sets RLIMIT_AS codes should get malloc() failing rather than getting killed by pbs_mom. cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From samuel at unimelb.edu.au Sat Feb 25 18:09:40 2012 From: samuel at unimelb.edu.au (Chris Samuel) Date: Sun, 26 Feb 2012 12:09:40 +1100 Subject: [torqueusers] vmem and pvmem In-Reply-To: <4F482228.8080208@anu.edu.au> References: <20120215231031.GA956@stikine.sfu.ca> <20120224220009.GC29630@stikine.sfu.ca> <4F482228.8080208@anu.edu.au> Message-ID: <201202261209.40523.samuel@unimelb.edu.au> On Saturday 25 February 2012 10:50:00 David Singleton wrote: > I disagree with vmem setting RLIMIT_AS if that is what is > happening. RILIMT_AS is the only way to limit the amount of memory a process can allocate in Linux with Torque (unless you have control group support enabled in both your kernel and in Torque). Current implementations of malloc() in glibc use mmap() for any allocation above a trivial value and the kernel only looks at RLIMIT_AS for mmap(). cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ From mailmaverick666 at gmail.com Mon Feb 27 05:40:13 2012 From: mailmaverick666 at gmail.com (rishi pathak) Date: Mon, 27 Feb 2012 18:10:13 +0530 Subject: [torqueusers] torque tmpdir on Lustre filesystem In-Reply-To: <4F47655E.6060503@cyf-kr.edu.pl> References: <4F3CD1E6.3050309@cyf-kr.edu.pl> <4F471801.6000207@unimelb.edu.au> <4F47655E.6060503@cyf-kr.edu.pl> Message-ID: Hello Lukasz, On Fri, Feb 24, 2012 at 3:54 PM, Lukasz Flis wrote: > Hello Christopher, Hi * > > > > > We don't use Lustre (we have Panasas and GPFS), but just wondering > > does this happen all the time, or only occasionally ? > > It happens occasionaly. But as I said - this seems like bug in Lustre > FS, and it has nothing to do with torque code. Torque is using unlucky > sequence of stat/mkdir functions which exposes lustre misbehaviour. > > > If occasionaly then if the job fails once, will it always fail, or > > will it work if you try again? > > Another call to the mkdirtree() function should succeed after few > seconds of sleep. > > I belive this behaviour in Lustre client appeared in 1.8.x line and > remains in 2.1.X. HP SFS IIRC is based on 1.4 and 1.6 so it's not affected. > HP SFS installed here is on Lustre 1.8.4 > > We have observed the BUG in Lustre 1.8.(4,5.6) infrastructure. Then we > moved to 2.1 line replacing all the components (servers,arrays,fabric) > and the bug remained. > > The problem with lustre is that mkdir() call on EXISTING directory > returns EPERM error instead of EEXIST once in a while, usually when > stat() is called before mkdir. > The $tmpdir variable is appended with jobid, so it would be a new path every time, unless the call is in a way similar to command "mkdir -p /mnt/lustre/scratch/jobs/" > I belive doing mkdir on a existing path is not very common practice and > that's the reason the BUG was unnoticed for a long time > > Cheers, > -- > Lukasz Flis > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- --- Rishi Pathak National PARAM Supercomputing Facility C-DAC, Pune, India -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120227/6310c02b/attachment-0001.html From dina.mahmoud87 at gmail.com Mon Feb 27 06:07:27 2012 From: dina.mahmoud87 at gmail.com (Dina Mahmoud) Date: Mon, 27 Feb 2012 13:07:27 +0000 (UTC) Subject: [torqueusers] [OSC mpiexec] Integrating it with torque References: Message-ID: How to know my torques is 32 or 64 bit ? From l.flis at cyf-kr.edu.pl Mon Feb 27 06:22:30 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Mon, 27 Feb 2012 14:22:30 +0100 Subject: [torqueusers] torque tmpdir on Lustre filesystem In-Reply-To: References: <4F3CD1E6.3050309@cyf-kr.edu.pl> <4F471801.6000207@unimelb.edu.au> <4F47655E.6060503@cyf-kr.edu.pl> Message-ID: <4F4B8396.4010109@cyf-kr.edu.pl> Hi, > > HP SFS installed here is on Lustre 1.8.4 > > > We have observed the BUG in Lustre 1.8.(4,5.6) infrastructure. Then we > moved to 2.1 line replacing all the components (servers,arrays,fabric) > and the bug remained. > > The problem with lustre is that mkdir() call on EXISTING directory > returns EPERM error instead of EEXIST once in a while, usually when > stat() is called before mkdir. > > > The $tmpdir variable is appended with jobid, so it would be a new path > every time, > unless the call is in a way similar to command "mkdir -p > /mnt/lustre/scratch/jobs/" > mkdir command from core utils is a different case because it is issuing stat() call before calling mkdir(). mkdirtree function from torque is invoking mkdir() all along the path expecting EEXIST error when mkdir()ing existig directories. If stat is not called before mkdir() and some time has passed since last access to the directory mkdir() will return EPERM instead of EACCESS. Next mkdir() call with same arguments will return EACCESS again. Today one of our users hit the bug when using QuantumEspresso software. I'm waiting to see what Whamcould can say about it. Cheers -- Lukasz Flis From pankaj.dorlikar at gmail.com Sat Feb 25 04:12:05 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Sat, 25 Feb 2012 16:42:05 +0530 Subject: [torqueusers] job does not start Message-ID: Hi, We have Torque Server Version 2.5.8 and maui version 3.2.6p1 installed on rhel 5.2 server. "showstart" for one of the jobs says that job should start now i.e. Earliest start in 00:00:00 on current time. ######################## checkjob -vv says that checkjob -vv 62235 checking job 62235 (RM job '62235.yc9.cn.yuva.param') State: Idle Creds: user:abcd group:pqr account:PQR-PR class:q1 qos:q1-qos WallTime: 00:00:00 of 2:05:00:00 SubmitTime: Thu Feb 23 18:56:26 (Time Queued Total: 1:21:27:05 Eligible: 1:21:27:05) Total Tasks: 2 Req[0] TaskCount: 2 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 NodeAccess: SHARED NodeCount: 0 IWD: [NONE] Executable: [NONE] Bypass: 51 StartCount: 0 PartitionMask: [ALL] Reservation '62235' (00:00:00 -> 2:05:00:00 Duration: 2:05:00:00) PE: 2.00 StartPriority: 2727 job cannot run in partition DEFAULT (insufficient idle procs available: 0 < 2) job can run in partition P1 (32 procs available. 2 procs required) job can run in partition P2 (48 procs available. 2 procs required) ######################## showres -n 62235 says that reservations on Sat Feb 25 16:28:10 NodeName Type ReservationID JobState Task Start Duration StartTime node16.clusternode Job 62235 Idle 2 00:00:00 2:05:00:00 Sat Feb 25 16:28:10 1 nodes reserved ############################ checknode node16.clusternode says that node is available for job run. but somehow job is not going and is not giving any error in maui, pbs_server,pbs_mom logs also. What can be the issue? What can be done to make job run and avoid the same in future? thank you -pankakjd -- Pankaj V. Dorlikar -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120225/d3346fa1/attachment.html From arkaaloke at gmail.com Sun Feb 26 16:24:17 2012 From: arkaaloke at gmail.com (Arka Aloke Bhattacharya) Date: Sun, 26 Feb 2012 15:24:17 -0800 Subject: [torqueusers] reducing energy usage of torque Message-ID: Hi everyone, I am a PhD student at UC Berkeley, and I wanted to add a "turning off idle/underutilized servers" feature to our 100 server torque+maui deployment. However, I want to implement this feature using only existing torque+ maui interfaces and extensions ( i,e *without modifying* the torque or maui source code in any way ). My proposed way is to 1. monitor the maui queue length , and estimate the number of servers I can switch off. 2. I would then use "pbsnodes -o " command to render a certain number of servers offline for scheduling. 3. A bash script would turn the servers off. The servers would be turned back on (and added to the torque nodes list) when the queue length increases beyond a certain threshold. I had two questions : 1. Is there any existing open source code which already implements the "turning off idle servers" functionality in torque ? 2. Are there complications that would arise if I implemented the "turning-off idle servers" feature in my proposed way ? [ e.g - Is it possible that after turning off servers, they would lose some state and hence would not get added to the torque when turned back on? Are there long lived TCP connections which need to be restarted separately ? , etc ] It would be great if anyone could help. Thanks a lot, Arka. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120226/a9463fcc/attachment.html From akohlmey at cmm.chem.upenn.edu Mon Feb 27 07:01:19 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 27 Feb 2012 09:01:19 -0500 Subject: [torqueusers] [OSC mpiexec] Integrating it with torque In-Reply-To: References: Message-ID: On Mon, Feb 27, 2012 at 8:07 AM, Dina Mahmoud wrote: > > How to know my torques is 32 or 64 bit ? file /usr/local/torque/bin/qsub ? > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -- Dr. Axel Kohlmeyer? ? akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From pankaj.dorlikar at gmail.com Mon Feb 27 11:20:38 2012 From: pankaj.dorlikar at gmail.com (pankaj dorlikar) Date: Mon, 27 Feb 2012 23:50:38 +0530 Subject: [torqueusers] job is not running Message-ID: ---------- Forwarded message ---------- From: pankaj dorlikar Date: Sat, Feb 25, 2012 at 4:42 PM Subject: job does not start hi, We have Torque Server Version 2.5.8 and maui version 3.2.6p1 installed on rhel 5.2 server. "showstart" for one of the jobs says that job should start now i.e. Earliest start in 00:00:00 on current time. ######################## checkjob -vv says that checkjob -vv 62235 checking job 62235 (RM job '62235.pbs_server.clusternode') State: Idle Creds: user:abcd group:pqr account:PQR-PR class:q1 qos:q1-qos WallTime: 00:00:00 of 2:05:00:00 SubmitTime: Thu Feb 23 18:56:26 (Time Queued Total: 1:21:27:05 Eligible: 1:21:27:05) Total Tasks: 2 Req[0] TaskCount: 2 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 NodeAccess: SHARED NodeCount: 0 IWD: [NONE] Executable: [NONE] Bypass: 51 StartCount: 0 PartitionMask: [ALL] Reservation '62235' (00:00:00 -> 2:05:00:00 Duration: 2:05:00:00) PE: 2.00 StartPriority: 2727 job cannot run in partition DEFAULT (insufficient idle procs available: 0 < 2) job can run in partition P1 (32 procs available. 2 procs required) job can run in partition P2 (48 procs available. 2 procs required) ######################## showres -n 62235 says that reservations on Sat Feb 25 16:28:10 NodeName Type ReservationID JobState Task Start Duration StartTime node16.clusternode Job 62235 Idle 2 00:00:00 2:05:00:00 Sat Feb 25 16:28:10 1 nodes reserved ############################ checknode node16.clusternode says that node is available for job run. but somehow job is not going and is not giving any error in maui, pbs_server,pbs_mom logs also. What can be the issue? What can be done to make job run and avoid the same in future? thank you -pankakjd -- Pankaj V. Dorlikar -- Pankaj V. Dorlikar -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120227/267f547f/attachment-0001.html From Gareth.Williams at csiro.au Mon Feb 27 16:35:07 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Tue, 28 Feb 2012 10:35:07 +1100 Subject: [torqueusers] reducing energy usage of torque In-Reply-To: References: Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74EA3@exvic-mbx04.nexus.csiro.au> Hi Arka, Your approach seems fine. There should be no particular complications, just take sufficient care to ensure that nodes are actually idle. The nodes are essentially stateless for your purpose. Gareth From: Arka Aloke Bhattacharya [mailto:arkaaloke at gmail.com] Sent: Monday, 27 February 2012 10:24 AM To: torqueusers at supercluster.org Subject: [torqueusers] reducing energy usage of torque Hi everyone, I am a PhD student at UC Berkeley, and I wanted to add a "turning off idle/underutilized servers" feature to our 100 server torque+maui deployment. However, I want to implement this feature using only existing torque+ maui interfaces and extensions ( i,e without modifying the torque or maui source code in any way ). My proposed way is to 1. monitor the maui queue length , and estimate the number of servers I can switch off. 2. I would then use "pbsnodes -o " command to render a certain number of servers offline for scheduling. 3. A bash script would turn the servers off. The servers would be turned back on (and added to the torque nodes list) when the queue length increases beyond a certain threshold. I had two questions : 1. Is there any existing open source code which already implements the "turning off idle servers" functionality in torque ? 2. Are there complications that would arise if I implemented the "turning-off idle servers" feature in my proposed way ? [ e.g - Is it possible that after turning off servers, they would lose some state and hence would not get added to the torque when turned back on? Are there long lived TCP connections which need to be restarted separately ? , etc ] It would be great if anyone could help. Thanks a lot, Arka. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120228/0fd7a45a/attachment.html From samuel at unimelb.edu.au Mon Feb 27 20:45:54 2012 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Tue, 28 Feb 2012 14:45:54 +1100 Subject: [torqueusers] [OSC mpiexec] Integrating it with torque In-Reply-To: References: Message-ID: <4F4C4DF2.7010605@unimelb.edu.au> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 28/02/12 00:07, Dina Mahmoud wrote: > How to know my torques is 32 or 64 bit ? How did you compile it, on what OS and what architecture ? - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk9MTfIACgkQO2KABBYQAh+mhQCfZ6SWdt5q1EQ3/hAK2i7zU0g1 B98AnigWOGVudlpr92L5zAP+Y/iG/efC =L0JK -----END PGP SIGNATURE----- From danielfcoimbra at gmail.com Tue Feb 28 00:43:28 2012 From: danielfcoimbra at gmail.com (Daniel Fernando Coimbra) Date: Tue, 28 Feb 2012 04:43:28 -0300 Subject: [torqueusers] reducing energy usage of torque In-Reply-To: References: Message-ID: <4F4C85A0.3080004@gmail.com> I assume that by "turning off" you mean actually power down the node. I am just curious on how do you intend to power it up again later. I suppose you could use something like WakeUp on Lan, but I never actually got to test this kind of thing and don't know how it would behave on a high traffic network (I suppose the network card doesn't keep it's IP once it's in such state). On 02/26/2012 08:24 PM, Arka Aloke Bhattacharya wrote: > Hi everyone, > > I am a PhD student at UC Berkeley, and I wanted to add a "turning off > idle/underutilized servers" feature to our 100 server torque+maui > deployment. However, I want to implement this feature using only > existing torque+ maui interfaces and extensions ( i,e _without > modifying_ the torque or maui source code in any way ). > > My proposed way is to > 1. monitor the maui queue length , and estimate the number of servers > I can switch off. > 2. I would then use "pbsnodes -o " command to render a > certain number of servers offline for scheduling. > 3. A bash script would turn the servers off. > > The servers would be turned back on (and added to the torque nodes > list) when the queue length increases beyond a certain threshold. > > I had two questions : > > 1. Is there any existing open source code which already implements the > "turning off idle servers" functionality in torque ? > 2. Are there complications that would arise if I implemented the > "turning-off idle servers" feature in my proposed way ? [ e.g - Is it > possible that after turning off servers, they would lose some state > and hence would not get added to the torque when turned > back on? Are there long lived TCP connections which need to be > restarted separately ? , etc ] > > It would be great if anyone could help. > > Thanks a lot, > Arka. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From ngsbioinformatics at gmail.com Tue Feb 28 07:08:00 2012 From: ngsbioinformatics at gmail.com (Ryan Golhar) Date: Tue, 28 Feb 2012 09:08:00 -0500 Subject: [torqueusers] reducing energy usage of torque In-Reply-To: <4F4C85A0.3080004@gmail.com> References: <4F4C85A0.3080004@gmail.com> Message-ID: What about cycling the power using a PDU? On Tue, Feb 28, 2012 at 2:43 AM, Daniel Fernando Coimbra < danielfcoimbra at gmail.com> wrote: > I assume that by "turning off" you mean actually power down the node. I > am just curious on how do you intend to power it up again later. I > suppose you could use something like WakeUp on Lan, but I never actually > got to test this kind of thing and don't know how it would behave on a > high traffic network (I suppose the network card doesn't keep it's IP > once it's in such state). > > On 02/26/2012 08:24 PM, Arka Aloke Bhattacharya wrote: > > Hi everyone, > > > > I am a PhD student at UC Berkeley, and I wanted to add a "turning off > > idle/underutilized servers" feature to our 100 server torque+maui > > deployment. However, I want to implement this feature using only > > existing torque+ maui interfaces and extensions ( i,e _without > > modifying_ the torque or maui source code in any way ). > > > > My proposed way is to > > 1. monitor the maui queue length , and estimate the number of servers > > I can switch off. > > 2. I would then use "pbsnodes -o " command to render a > > certain number of servers offline for scheduling. > > 3. A bash script would turn the servers off. > > > > The servers would be turned back on (and added to the torque nodes > > list) when the queue length increases beyond a certain threshold. > > > > I had two questions : > > > > 1. Is there any existing open source code which already implements the > > "turning off idle servers" functionality in torque ? > > 2. Are there complications that would arise if I implemented the > > "turning-off idle servers" feature in my proposed way ? [ e.g - Is it > > possible that after turning off servers, they would lose some state > > and hence would not get added to the torque when turned > > back on? Are there long lived TCP connections which need to be > > restarted separately ? , etc ] > > > > It would be great if anyone could help. > > > > Thanks a lot, > > Arka. > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120228/8c134779/attachment-0001.html From raub at uni-duesseldorf.de Tue Feb 28 07:21:30 2012 From: raub at uni-duesseldorf.de (Dr. Stephan Raub) Date: Tue, 28 Feb 2012 15:21:30 +0100 Subject: [torqueusers] reducing energy usage of torque In-Reply-To: References: <4F4C85A0.3080004@gmail.com> Message-ID: <013201ccf624$4438f220$ccaad660$@de> Hi, all of our nodes (compute nodes and service nodes) are equipped with IPMI-capable BMCs (Baseboard Management Controller) so that we can control all aspects of power (including measuring the current power consumption, turning it on or off, power cycles, etc ) from the batch server just by using the ipmitools-package. We have used this for controlling nodes within a torque/maui in the context of a bachelor project. But our clusters is so busy all the time, that we could not find a dramatic reduction of the over-all power consumption of the cluster (including water cooling of the racks). Best regards -- --------------------------------------------------------- | | Dr. rer. nat. Stephan Raub | | Dipl. Chem. | | High-Performance-Computing | | Zentrum f?r Informations- und Medientechnologie | | Heinrich-Heine-Universit?t D?sseldorf | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 | | 40225 D?sseldorf / Germany | | | | Tel: +49-211-811-3911 | | Fax: +49-211-811-2539 --------------------------------------------------------- Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, bzw. sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrt?mlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielf?ltigung oder Weitergabe der E-Mail ausdr?cklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen Dank. Important Note: This e-mail may contain trade secrets or privileged, undisclosed or otherwise confidential information. If you have received this e-mail in error, you are hereby notified that any review, copying or distribution of it is strictly prohibited. Please inform us immediately and destroy the original transmittal. Thank you for your cooperation. Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] Im Auftrag von Ryan Golhar Gesendet: Dienstag, 28. Februar 2012 15:08 An: Torque Users Mailing List Betreff: Re: [torqueusers] reducing energy usage of torque What about cycling the power using a PDU? On Tue, Feb 28, 2012 at 2:43 AM, Daniel Fernando Coimbra wrote: I assume that by "turning off" you mean actually power down the node. I am just curious on how do you intend to power it up again later. I suppose you could use something like WakeUp on Lan, but I never actually got to test this kind of thing and don't know how it would behave on a high traffic network (I suppose the network card doesn't keep it's IP once it's in such state). On 02/26/2012 08:24 PM, Arka Aloke Bhattacharya wrote: > Hi everyone, > > I am a PhD student at UC Berkeley, and I wanted to add a "turning off > idle/underutilized servers" feature to our 100 server torque+maui > deployment. However, I want to implement this feature using only > existing torque+ maui interfaces and extensions ( i,e _without > modifying_ the torque or maui source code in any way ). > > My proposed way is to > 1. monitor the maui queue length , and estimate the number of servers > I can switch off. > 2. I would then use "pbsnodes -o " command to render a > certain number of servers offline for scheduling. > 3. A bash script would turn the servers off. > > The servers would be turned back on (and added to the torque nodes > list) when the queue length increases beyond a certain threshold. > > I had two questions : > > 1. Is there any existing open source code which already implements the > "turning off idle servers" functionality in torque ? > 2. Are there complications that would arise if I implemented the > "turning-off idle servers" feature in my proposed way ? [ e.g - Is it > possible that after turning off servers, they would lose some state > and hence would not get added to the torque when turned > back on? Are there long lived TCP connections which need to be > restarted separately ? , etc ] > > It would be great if anyone could help. > > Thanks a lot, > Arka. > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120228/169efa21/attachment.html From L.S.Lowe at bham.ac.uk Tue Feb 28 08:53:47 2012 From: L.S.Lowe at bham.ac.uk (Lawrence Lowe) Date: Tue, 28 Feb 2012 15:53:47 +0000 (GMT) Subject: [torqueusers] reducing energy usage of torque In-Reply-To: <013201ccf624$4438f220$ccaad660$@de> References: <4F4C85A0.3080004@gmail.com> <013201ccf624$4438f220$ccaad660$@de> Message-ID: Hi, we do something similar here on our local cluster, where demand can be peaky. (On our Grid cluster, we don't bother to do this as there is constant demand. On our Uni cluster, it has network-controlled PDUs, and Moab is used and that already comes with some green computing support). The worker nodes have wake-up-on-LAN enabled as a default. If a monitoring script on a master node detects that there aren't enough free worker nodes to do the work, then it wakes-up a down non-offline node using "wol"; a bit of code ensures that it won't try to wake up the same node for a while. If a worker node self-detects in a cron script that it hasn't run any jobs for a while by looking at directory time-stamps, then it shuts down. Nothing complicated, blatantly asymmetric, and it seems to work in our admittedly simple environment. When I did this I thought it would be a nice feature (torque developers!) if, instead of a plain poweroff, I was able to use a "momctl -S" which conditionally shutdown the mom and cleanly informed the pbs_server to mark it down, if it was empty of jobs, but returned an error code if the mom [now] had jobs. I guess the same effect could be programmed for a specific new "terminate if empty" signal handler in pbs_mom. There are probably other ways of avoiding race conditions, using pbsnodes -o, or signalling the scheduler not to schedule jobs while decisions are made, but that's more complicated. LL On Tue, 28 Feb 2012, Dr. Stephan Raub wrote: > > Hi, > > ? > > all of our nodes (compute nodes and service nodes) are equipped with IPMI-capable BMCs > (Baseboard Management Controller) so that we can control all aspects of power (including > measuring the current power consumption, turning it on or off, power cycles, etc?) from the > batch server just by using the ipmitools-package. > > ? > > We have used this for controlling nodes within a torque/maui in the context of a bachelor > project. But our clusters is so busy all the time, that we could not find a dramatic > reduction of the over-all power consumption of the cluster (including water cooling of the > racks). > > ? > > Best regards > > -- > > --------------------------------------------------------- > > | | Dr. rer. nat. Stephan Raub > > | | Dipl. Chem. > > | | High-Performance-Computing > > | | Zentrum f?r Informations- und Medientechnologie > > | | Heinrich-Heine-Universit?t D?sseldorf > > | | Universit?tsstr. 1 / Raum 25.41.O2.25-2 > > | | 40225 D?sseldorf / Germany > > | | > > | | Tel: +49-211-811-3911 > > | | Fax: +49-211-811-2539 > > --------------------------------------------------------- > > ? > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Gesch?ftsgeheimnisse, bzw. > > sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrt?mlich erhalten > haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielf?ltigung oder Weitergabe der > E-Mail ausdr?cklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die > empfangene E-Mail. Vielen Dank. > > ? > > Important Note: This e-mail may contain trade secrets or privileged, undisclosed or > otherwise confidential information. If you have received this e-mail in error, you are > hereby notified that any review, copying or distribution of it is strictly prohibited. > Please inform us immediately and destroy the original transmittal. Thank you for your > cooperation. > > ? > > Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] Im > Auftrag von Ryan Golhar > Gesendet: Dienstag, 28. Februar 2012 15:08 > An: Torque Users Mailing List > Betreff: Re: [torqueusers] reducing energy usage of torque > > ? > > What about cycling the power using a PDU? > > On Tue, Feb 28, 2012 at 2:43 AM, Daniel Fernando Coimbra wrote: > > I assume that by "turning off" you mean actually power down the node. I > am just curious on how do you intend to power it up again later. ?I > suppose you could use something like WakeUp on Lan, but I never actually > got to test this kind of thing and don't know how it would behave on a > high traffic network (I suppose the network card doesn't keep it's IP > once it's in such state). > > > On 02/26/2012 08:24 PM, Arka Aloke Bhattacharya wrote: > > Hi everyone, > > > > I am a PhD student at UC Berkeley, and I wanted to add a "turning off > > idle/underutilized servers" feature to our 100 server torque+maui > > deployment. However, I want to implement this feature using only > > > existing torque+ maui interfaces and extensions ( i,e _without > > modifying_ the torque or maui source code in any way ). > > > > > My proposed way is to > > 1. monitor the maui queue length , and estimate the number of servers > > I can switch off. > > 2. I would then use "pbsnodes -o " command to render a > > certain number of servers offline for scheduling. > > 3. A bash script would turn the servers off. > > > > The servers would be turned back on (and added to the torque nodes > > list) when the queue length increases beyond a certain threshold. > > > > I had two questions : > > > > 1. Is there any existing open source code which already implements the > > "turning off idle servers" functionality in torque ? > > 2. Are there complications that would arise if I implemented the > > "turning-off idle servers" feature in my proposed way ? [ e.g - Is it > > possible that after turning off servers, they would lose some state > > and hence would not get added to the torque when turned > > back on? Are there long lived TCP connections which need to be > > restarted separately ? , etc ] > > > > It would be great if anyone could help. > > > > Thanks a lot, > > Arka. > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > ? > > > From jonb at lanl.gov Tue Feb 28 15:13:07 2012 From: jonb at lanl.gov (Jon Bringhurst) Date: Tue, 28 Feb 2012 15:13:07 -0700 Subject: [torqueusers] W3C HPC Web Working Group Message-ID: <4F4D5173.4050408@lanl.gov> Since there's a chance this new working group may become relevant to resource managers in the future, I thought it may be a good idea to spread the word to a larger community. "This [W3C] community group is focused on bringing high performance computing (HPC) to the web." For the web site: For the mailing list, send a message to with the word "subscribe" in the subject. There's also a telecon planned for March 7 at 11:00 am PST. Call-in information will be posted to the mailing list at a later date. -Jon From sergey_bulk at list.ru Wed Feb 29 03:17:49 2012 From: sergey_bulk at list.ru (=?UTF-8?B?U2VyZ2V5IEJ1bGs=?=) Date: Wed, 29 Feb 2012 14:17:49 +0400 Subject: [torqueusers] =?utf-8?q?array_job_and_resources?= Message-ID: I have torque 2.5.7-9.el6 from epel repo on SL6. I have 24-core nodes node01-node12. When requesting resources for an array job #!/bin/bash #PBS -t 1-12 #PBS -l nodes=node01:ppn=1+node02:ppn=1+node03:ppn=1+node04:ppn=1+....+node12:ppn=1 #PBS -d . for f in `seq 1 1000`; do ps aux done; I would expect that each job in the array should occupy its own node, but, instead, -l option is for every job not for the whole array. So all jobs are running on the node01 because there is enough cores. What is the correct way to request resources for the whole array job rather then for the single job in the array? Thank you, SN From sm4082 at nyu.edu Wed Feb 29 20:51:03 2012 From: sm4082 at nyu.edu (Sreedhar Manchu) Date: Wed, 29 Feb 2012 22:51:03 -0500 Subject: [torqueusers] Step Change in Job Arrays In-Reply-To: References: Message-ID: Hello, Putting the code below in qsub wrapper would make torque recognize step change like 1-20:2 or 0-19:3, etc. This works only when -t is mentioned in the command line. I made my script recognize the same even when users mention it inside the script as well. But it's quite long. I guess once we find #PBS -t 2-23:3, we just have to make the script change it to #PBS -t 2,5,8,11,14,17,20,23. Which can be easily done with Andy's method of using seq. Anyway, here is the code for wrapper for making it work with command line. #!/bin/sh args=("$@") for((arg=0;arg<$#;arg++)) do if [ "${args[$arg]}" = "-t" ] then if echo "${args[$(($arg+1))]}" | egrep "^(^0|^[1-9][0-9]*)-[1-9][0-9]*:[1-9][0-9]*$" > /dev/null 2>&1 then array_arg="${args[$(($arg+1))]}" array_arg="$(seq -s, $(echo $array_arg | cut -f1 -d-) $(echo $array_arg | cut -f2 -d:) $(echo $array_arg | cut -f1 -d: | cut -f2 -d-))" str="set -- \"\${@:1:$(($arg+1))}\" \"${array_arg}\" \"\${@:$(($arg+3))}\"" eval `echo "$str"` break fi fi done In case you already have a wrapper in place, it doesn't do any harm to keep it at the beginning of it. Sreedhar. On Wed, Feb 8, 2012 at 9:47 AM, Ibad Kureshi U0850037 wrote: > Thanks Glen, Andy > > Andy: Nice! > > -Ibad > > > ________________________________________ > From: torqueusers-bounces at supercluster.org [ > torqueusers-bounces at supercluster.org] On Behalf Of Andrew Caird [ > acaird at umich.edu] > Sent: Wednesday, February 08, 2012 2:28 PM > To: Torque Users Mailing List > Subject: Re: [torqueusers] Step Change in Job Arrays > > On Wed, Feb 8, 2012 at 9:02 AM, Glen Beane glen.beane at gmail.com>> wrote: > On Wed, Feb 8, 2012 at 8:57 AM, Ibad Kureshi U0850037 > > wrote: > > Hello, > > > > I was wondering is someone could tell me how to adjust the step size in > a job array. We are running Torque 2.5.7 with the PBS_SCHEDD on a small > cluster and our users want to submit arrays. > > > > One the SGE and the Moab/Torque based systems > > > > $ -t 1-20:2 > > > > or > > > > #PBS -t 1-20:2 > > > > respectively, gives them 10 jobs with even ID numbers. > > > > How can this be done with Torque? It throws out "qsub: Bad Job Array > Request" error > > > > Have not been able to find much literature on this. > > > > Thanks > > > this is not currently supported, but it is a great feature request. > > unfortunately the only option would be to explicitly specify each array ID: > > #PBS -t 2,4,6,8,10 ...20 > > Or: > > qsub -t `seq -s, 2 2 20` pbsfile.txt > > in case you don't want to type all the numbers. > > --andy > > > --- > This transmission is confidential and may be legally privileged. If you > receive it in error, please notify us immediately by e-mail and remove it > from your system. If the content of this e-mail does not relate to the > business of the University of Huddersfield, then we do not endorse it and > will accept no liability. > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Sreedhar Manchu HPC Support Specialist New York University 251 Mercer Street New York, NY 10012-1110 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120229/541dacf5/attachment.html